Update README.md
Browse files
README.md
CHANGED
@@ -20,6 +20,8 @@ base_model:
|
|
20 |
| Voyage-Code-002 | Unknown | 68.5 | 56.3 |
|
21 |
|
22 |
|
|
|
|
|
23 |
# Usage
|
24 |
|
25 |
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
|
@@ -47,3 +49,9 @@ print(query_embeddings)
|
|
47 |
code_embeddings = model.encode(codes)
|
48 |
print(code_embeddings)
|
49 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
| Voyage-Code-002 | Unknown | 68.5 | 56.3 |
|
21 |
|
22 |
|
23 |
+
We release the scripts to evaluate our model's performance [here](https://github.com/gangiswag/cornstack).
|
24 |
+
|
25 |
# Usage
|
26 |
|
27 |
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
|
|
|
49 |
code_embeddings = model.encode(codes)
|
50 |
print(code_embeddings)
|
51 |
```
|
52 |
+
|
53 |
+
|
54 |
+
|
55 |
+
## Training
|
56 |
+
We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
|
57 |
+
|