Update README.md with latest performance.
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ tags:
|
|
14 |
# Kwaipilot OASIS-1.5B
|
15 |
|
16 |
## Model Details
|
17 |
-
**Model Name**: OASIS (
|
18 |
|
19 |
**Introduction**
|
20 |
|
@@ -28,24 +28,19 @@ This model is ideal for developers and researchers engaged in enhancing **code r
|
|
28 |
|
29 |
OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
|
30 |
|
31 |
-
|
32 |
-
Kwaipilot upcoming initiatives include:
|
33 |
-
|
34 |
-
- Open sourcing improved models.
|
35 |
-
- Releasing technical reports.
|
36 |
-
- Releasing natural language processing models.
|
37 |
-
- ...
|
38 |
|
39 |
|
40 |
## Performance
|
41 |
|
42 |
| | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja | CSN-JS | CSN-PHP | CSN-Go | CSN-Ruby | Avg|
|
43 |
|-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
|
44 |
-
|
|
|
|
45 |
|jina-embeddings-v2-base-code | 161M |**0.6837** |0.385 | 0.6634 | 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|0.6477|
|
46 |
-
| CodeSage-large | 1.3B | 0.4753|
|
47 |
| CodeFuse-CGE-Small | 3.8B | 0.5619| 0.4639 | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|0.6594|
|
48 |
-
| OASIS-code-1.5B | 1.5B | 0.
|
49 |
|
50 |
## Usage
|
51 |
|
@@ -96,20 +91,20 @@ code2 = """def quick_sort(arr):
|
|
96 |
less = [x for x in arr[1:] if x <= pivot]
|
97 |
greater = [x for x in arr[1:] if x > pivot]
|
98 |
return quick_sort(less) + [pivot] + quick_sort(greater)"""
|
99 |
-
model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.
|
100 |
-
tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.
|
101 |
|
102 |
# Tokenize and inference
|
103 |
-
inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=
|
104 |
outputs = model(**inputs)
|
105 |
# Last token pooling
|
106 |
embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
|
107 |
print(embeddings.shape)
|
108 |
-
# torch.Size([3,
|
109 |
embeddings = F.normalize(embeddings, dim=1, p=2)
|
110 |
similarity = embeddings @ embeddings.T
|
111 |
print(similarity[0, 1:])
|
112 |
-
# tensor([0.
|
113 |
```
|
114 |
### Sentence Transformers
|
115 |
First install the Sentence Transformers library:
|
@@ -120,7 +115,7 @@ Then you can load this model and run inference.
|
|
120 |
```python
|
121 |
from sentence_transformers import SentenceTransformer
|
122 |
# Download from the 🤗 Hub
|
123 |
-
model = SentenceTransformer("Kwaipilot/OASIS-code-1.
|
124 |
query = "How to do quicksort in python?"
|
125 |
code1 = """def bubble_sort(arr):
|
126 |
n = len(arr)
|
@@ -145,12 +140,12 @@ code2 = """def quick_sort(arr):
|
|
145 |
query_embedding = model.encode([query], prompt_name="query")
|
146 |
code_embeddings = model.encode([code1, code2])
|
147 |
print(code_embeddings.shape)
|
148 |
-
# (2,
|
149 |
# Get the similarity scores for the embeddings
|
150 |
print(model.similarity(query_embedding[0], code_embeddings[0]))
|
151 |
print(model.similarity(query_embedding[0], code_embeddings[1]))
|
152 |
-
# tensor([[0.
|
153 |
-
# tensor([[0.
|
154 |
```
|
155 |
### BibTeX
|
156 |
```bibtex
|
|
|
14 |
# Kwaipilot OASIS-1.5B
|
15 |
|
16 |
## Model Details
|
17 |
+
**Model Name**: OASIS (Order-Augmented Strategy for Improved Code Search)
|
18 |
|
19 |
**Introduction**
|
20 |
|
|
|
28 |
|
29 |
OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
|
30 |
|
31 |
+
Our preprint is now available [OASIS-arxiv](https://arxiv.org/abs/2503.08161).
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
|
34 |
## Performance
|
35 |
|
36 |
| | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja | CSN-JS | CSN-PHP | CSN-Go | CSN-Ruby | Avg|
|
37 |
|-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
|
38 |
+
|OpenAI-Embedding-Ada-002 | Unknown | 0.4423| 0.3808 | 0.6802 | 0.7149| 0.6750| 0.6062| 0.8563| 0.7472|0.6378|
|
39 |
+
|OpenAI-Text-embedding-3-large | Unknown | 0.5538| 0.4684| 0.7084| 0.7292| 0.6813| 0.5959| 0.8764|0.7525|0.6707|
|
40 |
|jina-embeddings-v2-base-code | 161M |**0.6837** |0.385 | 0.6634 | 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|0.6477|
|
41 |
+
| CodeSage-large | 1.3B | 0.4753| 0.5267 | 0.7077 | 0.7021| 0.695 | 0.6133| 0.8371| 0.7192|0.6595|
|
42 |
| CodeFuse-CGE-Small | 3.8B | 0.5619| 0.4639 | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|0.6594|
|
43 |
+
| OASIS-code-1.5B | 1.5B | 0.5577| **0.5727** | **0.7369** | **0.7397**| **0.6980**| **0.6384**| **0.8821**| **0.7547**|**0.6975**|
|
44 |
|
45 |
## Usage
|
46 |
|
|
|
91 |
less = [x for x in arr[1:] if x <= pivot]
|
92 |
greater = [x for x in arr[1:] if x > pivot]
|
93 |
return quick_sort(less) + [pivot] + quick_sort(greater)"""
|
94 |
+
model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.5B", output_hidden_states=True)
|
95 |
+
tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.5B")
|
96 |
|
97 |
# Tokenize and inference
|
98 |
+
inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=1024, padding=True, truncation=True, return_tensors='pt')
|
99 |
outputs = model(**inputs)
|
100 |
# Last token pooling
|
101 |
embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
|
102 |
print(embeddings.shape)
|
103 |
+
# torch.Size([3, 1536])
|
104 |
embeddings = F.normalize(embeddings, dim=1, p=2)
|
105 |
similarity = embeddings @ embeddings.T
|
106 |
print(similarity[0, 1:])
|
107 |
+
# tensor([0.6895, 0.8240])
|
108 |
```
|
109 |
### Sentence Transformers
|
110 |
First install the Sentence Transformers library:
|
|
|
115 |
```python
|
116 |
from sentence_transformers import SentenceTransformer
|
117 |
# Download from the 🤗 Hub
|
118 |
+
model = SentenceTransformer("Kwaipilot/OASIS-code-1.5B")#, model_kwargs={"torch_dtype": torch.bfloat16})
|
119 |
query = "How to do quicksort in python?"
|
120 |
code1 = """def bubble_sort(arr):
|
121 |
n = len(arr)
|
|
|
140 |
query_embedding = model.encode([query], prompt_name="query")
|
141 |
code_embeddings = model.encode([code1, code2])
|
142 |
print(code_embeddings.shape)
|
143 |
+
# (2, 1536)
|
144 |
# Get the similarity scores for the embeddings
|
145 |
print(model.similarity(query_embedding[0], code_embeddings[0]))
|
146 |
print(model.similarity(query_embedding[0], code_embeddings[1]))
|
147 |
+
# tensor([[0.6895]])
|
148 |
+
# tensor([[0.8240]])
|
149 |
```
|
150 |
### BibTeX
|
151 |
```bibtex
|