Update README.md
Browse files
README.md
CHANGED
@@ -13,34 +13,19 @@ tags:
|
|
13 |
|
14 |
# ModernColBERT + InSeNT
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
|
20 |
-
|
21 |
-
- **Model Type:** Sentence Transformer
|
22 |
-
- **Base model:** [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1)
|
23 |
-
- **Maximum Sequence Length:** tokens
|
24 |
-
- **Output Dimensionality:** 128 dimensions
|
25 |
-
- **Similarity Function:** MaxSim
|
26 |
-
- **Training Dataset:**
|
27 |
-
- train
|
28 |
-
<!-- - **Language:** Unknown -->
|
29 |
-
<!-- - **License:** Unknown -->
|
30 |
|
31 |
-
### Model Sources
|
32 |
|
33 |
-
|
34 |
-
- **Hugging Face:** [Contextual Embeddings](https://huggingface.co/illuin-conteb)
|
35 |
|
36 |
-
|
|
|
|
|
37 |
|
38 |
-
```
|
39 |
-
ColBERT(
|
40 |
-
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
|
41 |
-
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
|
42 |
-
)
|
43 |
-
```
|
44 |
|
45 |
## Usage
|
46 |
|
@@ -81,10 +66,35 @@ print(f"Shape of first chunk embedding: {embeddings[0][0].shape}") # torch.Size(
|
|
81 |
```
|
82 |
|
83 |
|
|
|
84 |
|
85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
86 |
|
87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
```bibtex
|
90 |
@misc{conti2025contextgoldgoldpassage,
|
|
|
13 |
|
14 |
# ModernColBERT + InSeNT
|
15 |
|
16 |
+
[](https://arxiv.org/abs/2505.24782)
|
17 |
+
[](https://github.com/illuin-tech/contextual-embeddings)
|
18 |
+
[](https://huggingface.co/illuin-conteb)
|
19 |
|
20 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/jq_zYRy23bOZ9qey3VY4v.png" width="800">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
|
|
22 |
|
23 |
+
This is a contextual model finetuned from [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) on the ConTEB training dataset. It was trained using the InSeNT training approach, detailed in the corresponding paper.
|
|
|
24 |
|
25 |
+
> [!WARNING]
|
26 |
+
> This experimental model stems from the paper [*Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings*](https://arxiv.org/abs/2505.24782).
|
27 |
+
> While results are promising, we have seen regression on standard embedding tasks, and using it in production will probably require further work on extending the training set to improve robustness and OOD generalization.
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Usage
|
31 |
|
|
|
66 |
```
|
67 |
|
68 |
|
69 |
+
## Model Details
|
70 |
|
71 |
+
### Model Description
|
72 |
+
- **Model Type:** Sentence Transformer
|
73 |
+
- **Base model:** [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1)
|
74 |
+
- **Maximum Sequence Length:** tokens
|
75 |
+
- **Output Dimensionality:** 128 dimensions
|
76 |
+
- **Similarity Function:** MaxSim
|
77 |
+
- **Training Dataset:**
|
78 |
+
- train
|
79 |
+
<!-- - **Language:** Unknown -->
|
80 |
+
<!-- - **License:** Unknown -->
|
81 |
+
|
82 |
+
### Model Sources
|
83 |
+
|
84 |
+
- **Repository:** [Contextual Embeddings](https://github.com/illuin-tech/contextual-embeddings)
|
85 |
+
- **Hugging Face:** [Contextual Embeddings](https://huggingface.co/illuin-conteb)
|
86 |
+
|
87 |
+
### Full Model Architecture
|
88 |
|
89 |
+
```
|
90 |
+
ColBERT(
|
91 |
+
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
|
92 |
+
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
|
93 |
+
)
|
94 |
+
```
|
95 |
+
|
96 |
+
|
97 |
+
## Citation
|
98 |
|
99 |
```bibtex
|
100 |
@misc{conti2025contextgoldgoldpassage,
|