Sentence Similarity
Safetensors
Japanese
bert
feature-extraction
File size: 4,507 Bytes
dc40a34
59543b3
 
dc40a34
 
 
 
 
59543b3
dc40a34
 
59543b3
dc40a34
 
a73b950
dc40a34
 
 
 
 
 
 
 
 
 
 
 
 
 
a73b950
dc40a34
 
 
a73b950
 
dc40a34
 
 
 
 
 
 
 
 
a73b950
dc40a34
 
 
 
a73b950
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc40a34
 
 
a73b950
dc40a34
a73b950
 
 
 
 
 
 
 
 
dc40a34
a73b950
dc40a34
a73b950
 
 
dc40a34
a73b950
dc40a34
a73b950
 
 
 
 
 
dc40a34
 
 
 
a73b950
dc40a34
 
 
 
 
 
 
 
 
a73b950
dc40a34
 
a73b950
dc40a34
a73b950
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
base_model: cl-nagoya/ruri-pt-large
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
---

# Ruri: Japanese General Text Embeddings


## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-large")

sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1))
print(similarities.shape)
# [3, 3]
```

## Benchmarks

### JMTEB
Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).

|Model|#Param.|Retrieval|STS|Classfification|Reranking|Clustering|PairClassification|Avg.|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|[cl-nagoya/sup-simcse-ja-base](https://huggingface.co/cl-nagoya/sup-simcse-ja-base)|111M|49.64|82.05|73.47|91.83|51.79|62.57|68.56|
|[cl-nagoya/sup-simcse-ja-large](https://huggingface.co/cl-nagoya/sup-simcse-ja-large)|337M|37.62|83.18|73.73|91.48|50.56|62.51|66.51|
|[cl-nagoya/unsup-simcse-ja-base](https://huggingface.co/cl-nagoya/unsup-simcse-ja-base)|111M|40.23|78.72|73.07|91.16|44.77|62.44|65.07|
|[cl-nagoya/unsup-simcse-ja-large](https://huggingface.co/cl-nagoya/unsup-simcse-ja-large)|337M|40.53|80.56|74.66|90.95|48.41|62.49|66.27|
|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|133M|59.02|78.71|76.82|91.90|49.78|66.39|70.44|
||||||||||
|[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|472M|40.12|76.56|72.66|91.63|44.88|62.33|64.70|
|[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)|118M|67.27|80.07|67.62|93.03|46.91|62.19|69.52|
|[intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)|278M|68.21|79.84|69.30|92.85|48.26|62.26|70.12|
|[intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)|560M|70.98|79.70|72.89|92.96|51.24|62.15|71.65|
||||||||||
|OpenAI/text-embedding-ada-002|-|64.38|79.02|69.75|93.04|48.30|62.40|69.48|
|OpenAI/text-embedding-3-small|-|66.39|79.46|73.06|92.92|51.06|62.27|70.86|
|OpenAI/text-embedding-3-large|-|74.48|82.52|77.58|93.58|53.32|62.35|73.97|
||||||||||
|[Ruri-Small](https://huggingface.co/cl-nagoya/ruri-small)|68M|69.41|82.79|76.22|93.00|51.19|62.11|71.53|
|[Ruri-Base](https://huggingface.co/cl-nagoya/ruri-base)|111M|69.82|82.87|75.58|92.91|54.16|62.38|71.91|
|[Ruri-Large](https://huggingface.co/cl-nagoya/ruri-large)|337M|73.02|83.13|77.43|92.99|51.82|62.29|73.31|



## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [cl-nagoya/ruri-large-pt](https://huggingface.co/cl-nagoya/ruri-large-pt) 
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 1024
- **Similarity Function:** Cosine Similarity
- **Language:** Japanese
- **License:** Apache 2.0
<!-- - **Training Dataset:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
MySentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```


## Training Details


### Framework Versions
- Python: 3.10.13
- Sentence Transformers: 3.0.0
- Transformers: 4.41.2
- PyTorch: 2.3.1+cu118
- Accelerate: 0.30.1
- Datasets: 2.19.1
- Tokenizers: 0.19.1

<!-- ## Citation

### BibTeX
 -->

## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).