File size: 4,507 Bytes
dc40a34 59543b3 dc40a34 59543b3 dc40a34 59543b3 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 dc40a34 a73b950 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
base_model: cl-nagoya/ruri-pt-large
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
---
# Ruri: Japanese General Text Embeddings
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-large")
sentences = [
'The weather is lovely today.',
"It's so sunny outside!",
'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1))
print(similarities.shape)
# [3, 3]
```
## Benchmarks
### JMTEB
Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
|Model|#Param.|Retrieval|STS|Classfification|Reranking|Clustering|PairClassification|Avg.|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|[cl-nagoya/sup-simcse-ja-base](https://huggingface.co/cl-nagoya/sup-simcse-ja-base)|111M|49.64|82.05|73.47|91.83|51.79|62.57|68.56|
|[cl-nagoya/sup-simcse-ja-large](https://huggingface.co/cl-nagoya/sup-simcse-ja-large)|337M|37.62|83.18|73.73|91.48|50.56|62.51|66.51|
|[cl-nagoya/unsup-simcse-ja-base](https://huggingface.co/cl-nagoya/unsup-simcse-ja-base)|111M|40.23|78.72|73.07|91.16|44.77|62.44|65.07|
|[cl-nagoya/unsup-simcse-ja-large](https://huggingface.co/cl-nagoya/unsup-simcse-ja-large)|337M|40.53|80.56|74.66|90.95|48.41|62.49|66.27|
|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|133M|59.02|78.71|76.82|91.90|49.78|66.39|70.44|
||||||||||
|[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|472M|40.12|76.56|72.66|91.63|44.88|62.33|64.70|
|[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)|118M|67.27|80.07|67.62|93.03|46.91|62.19|69.52|
|[intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)|278M|68.21|79.84|69.30|92.85|48.26|62.26|70.12|
|[intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)|560M|70.98|79.70|72.89|92.96|51.24|62.15|71.65|
||||||||||
|OpenAI/text-embedding-ada-002|-|64.38|79.02|69.75|93.04|48.30|62.40|69.48|
|OpenAI/text-embedding-3-small|-|66.39|79.46|73.06|92.92|51.06|62.27|70.86|
|OpenAI/text-embedding-3-large|-|74.48|82.52|77.58|93.58|53.32|62.35|73.97|
||||||||||
|[Ruri-Small](https://huggingface.co/cl-nagoya/ruri-small)|68M|69.41|82.79|76.22|93.00|51.19|62.11|71.53|
|[Ruri-Base](https://huggingface.co/cl-nagoya/ruri-base)|111M|69.82|82.87|75.58|92.91|54.16|62.38|71.91|
|[Ruri-Large](https://huggingface.co/cl-nagoya/ruri-large)|337M|73.02|83.13|77.43|92.99|51.82|62.29|73.31|
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [cl-nagoya/ruri-large-pt](https://huggingface.co/cl-nagoya/ruri-large-pt)
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 1024
- **Similarity Function:** Cosine Similarity
- **Language:** Japanese
- **License:** Apache 2.0
<!-- - **Training Dataset:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
MySentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Training Details
### Framework Versions
- Python: 3.10.13
- Sentence Transformers: 3.0.0
- Transformers: 4.41.2
- PyTorch: 2.3.1+cu118
- Accelerate: 0.30.1
- Datasets: 2.19.1
- Tokenizers: 0.19.1
<!-- ## Citation
### BibTeX
-->
## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|