ruri-large / README.md

Update README.md

a73b950 verified 6 months ago

4.51 kB

	---
	language:
	- ja
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	base_model: cl-nagoya/ruri-pt-large
	widget: []
	pipeline_tag: sentence-similarity
	license: apache-2.0
	---

	# Ruri: Japanese General Text Embeddings


	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	import torch.nn.functional as F
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("cl-nagoya/ruri-large")

	sentences = [
	'The weather is lovely today.',
	"It's so sunny outside!",
	'He drove to the stadium.',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1))
	print(similarities.shape)
	# [3, 3]
	```

	## Benchmarks

	### JMTEB
	Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).

	\|Model\|#Param.\|Retrieval\|STS\|Classfification\|Reranking\|Clustering\|PairClassification\|Avg.\|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\|[cl-nagoya/sup-simcse-ja-base](https://huggingface.co/cl-nagoya/sup-simcse-ja-base)\|111M\|49.64\|82.05\|73.47\|91.83\|51.79\|62.57\|68.56\|
	\|[cl-nagoya/sup-simcse-ja-large](https://huggingface.co/cl-nagoya/sup-simcse-ja-large)\|337M\|37.62\|83.18\|73.73\|91.48\|50.56\|62.51\|66.51\|
	\|[cl-nagoya/unsup-simcse-ja-base](https://huggingface.co/cl-nagoya/unsup-simcse-ja-base)\|111M\|40.23\|78.72\|73.07\|91.16\|44.77\|62.44\|65.07\|
	\|[cl-nagoya/unsup-simcse-ja-large](https://huggingface.co/cl-nagoya/unsup-simcse-ja-large)\|337M\|40.53\|80.56\|74.66\|90.95\|48.41\|62.49\|66.27\|
	\|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)\|133M\|59.02\|78.71\|76.82\|91.90\|49.78\|66.39\|70.44\|
	\|\|\|\|\|\|\|\|\|\|
	\|[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)\|472M\|40.12\|76.56\|72.66\|91.63\|44.88\|62.33\|64.70\|
	\|[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)\|118M\|67.27\|80.07\|67.62\|93.03\|46.91\|62.19\|69.52\|
	\|[intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)\|278M\|68.21\|79.84\|69.30\|92.85\|48.26\|62.26\|70.12\|
	\|[intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)\|560M\|70.98\|79.70\|72.89\|92.96\|51.24\|62.15\|71.65\|
	\|\|\|\|\|\|\|\|\|\|
	\|OpenAI/text-embedding-ada-002\|-\|64.38\|79.02\|69.75\|93.04\|48.30\|62.40\|69.48\|
	\|OpenAI/text-embedding-3-small\|-\|66.39\|79.46\|73.06\|92.92\|51.06\|62.27\|70.86\|
	\|OpenAI/text-embedding-3-large\|-\|74.48\|82.52\|77.58\|93.58\|53.32\|62.35\|73.97\|
	\|\|\|\|\|\|\|\|\|\|
	\|[Ruri-Small](https://huggingface.co/cl-nagoya/ruri-small)\|68M\|69.41\|82.79\|76.22\|93.00\|51.19\|62.11\|71.53\|
	\|[Ruri-Base](https://huggingface.co/cl-nagoya/ruri-base)\|111M\|69.82\|82.87\|75.58\|92.91\|54.16\|62.38\|71.91\|
	\|[Ruri-Large](https://huggingface.co/cl-nagoya/ruri-large)\|337M\|73.02\|83.13\|77.43\|92.99\|51.82\|62.29\|73.31\|



	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [cl-nagoya/ruri-large-pt](https://huggingface.co/cl-nagoya/ruri-large-pt)
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 1024
	- Similarity Function: Cosine Similarity
	- Language: Japanese
	- License: Apache 2.0
	<!-- - Training Dataset: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	MySentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```


	## Training Details


	### Framework Versions
	- Python: 3.10.13
	- Sentence Transformers: 3.0.0
	- Transformers: 4.41.2
	- PyTorch: 2.3.1+cu118
	- Accelerate: 0.30.1
	- Datasets: 2.19.1
	- Tokenizers: 0.19.1

	<!-- ## Citation

	### BibTeX
	-->

	## License
	This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).