Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model:
|
3 |
+
- Snowflake/snowflake-arctic-embed-m-long
|
4 |
+
---
|
5 |
+
|
6 |
+
|
7 |
+
# CodeRankEmbed
|
8 |
+
|
9 |
+
`CodeRankEmbed` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.
|
10 |
+
|
11 |
+
|
12 |
+
# Performance Benchmarks
|
13 |
+
|
14 |
+
| Name | Parameters | CSN | CoIR |
|
15 |
+
| :-------------------------------:| :----- | :-------- | :------: |
|
16 |
+
| **CodeRankEmbed** | 137M | **77.9** |**60.1** |
|
17 |
+
| CodeSage-Large | 1.3B | 71.2 | 59.4 |
|
18 |
+
| Jina-Code-v2 | 161M | 67.2 | 58.4 |
|
19 |
+
| CodeT5+ | 110M | 74.2 | 45.9 |
|
20 |
+
| Voyage-Code-002 | Unknown | 68.5 | 56.3 |
|
21 |
+
|
22 |
+
|
23 |
+
# Usage
|
24 |
+
|
25 |
+
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
|
26 |
+
|
27 |
+
```python
|
28 |
+
from sentence_transformers import SentenceTransformer
|
29 |
+
|
30 |
+
model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True)
|
31 |
+
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number']
|
32 |
+
codes = ["""def func(n):
|
33 |
+
if n <= 0:
|
34 |
+
return "Input should be a positive integer"
|
35 |
+
elif n == 1:
|
36 |
+
return 0
|
37 |
+
elif n == 2:
|
38 |
+
return 1
|
39 |
+
else:
|
40 |
+
a, b = 0, 1
|
41 |
+
for _ in range(2, n):
|
42 |
+
a, b = b, a + b
|
43 |
+
return b
|
44 |
+
"""]
|
45 |
+
query_embeddings = model.encode(queries)
|
46 |
+
print(query_embeddings)
|
47 |
+
code_embeddings = model.encode(codes)
|
48 |
+
print(code_embeddings)
|
49 |
+
```
|