SentenceTransformer based on Lajavaness/bilingual-embedding-large
This is a sentence-transformers model finetuned from Lajavaness/bilingual-embedding-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Lajavaness/bilingual-embedding-large
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'Even if you remove JIMENEZ ....... IF THERE IS STILL A SMARTMATIC THAT IS GOOD AT MAGIC THERE IS ALSO NO ..... SMARTMATIC SHOULD BE REMOVED FROM THE COMELEC CONTRACT ..... BECAUSE THE DEMON COMELEC HAS LONG HONORED THE VOTE OF MANY PEOPLE ..... AS LONG AS THERE ARE COMMISSIONERS IN THE COMELEC WHO LOOK LIKE MONEY, WE WILL NOT HAVE A CLEAN ELECTION ....... JUST IMAGINE HOW LONG THE ISSUE SPREADS THAT IF A CANDIDATE WANTS TO WIN, IT WILL PAY THE COMELEC 25 MILLION ???????????????????????????? ? SO ARE THE ELECTION RESULTS HOKOS POKOS ?????????????????????? DEMONS ...... SO ALL THE PUNISHMENT OF HEAVEN HAS BEEN GIVEN IN THE PHILIPPINES BECAUSE TANING LIVES WITH US ...... THE THOUGHT IS PURE MONEY ..... SO EVEN ELECTIONS ARE MONEY ..... ..... 7:08 AM 4G 51% FINALLY, COMELEC OFFICIAL JIMENEZ, REMOVED IN PLACE. BY PRRD AND OTHERS AGAIN THIS. FOR CLEAN NOW ELECTION TO COMING 2022 ELECTION',
'Philippine President Rodrigo Duterte fired Comelec spokesman James Jimenez in May 2021 Posts misleadingly claim Philippine president fired poll body spokesman',
'The WHO declared covid-19 an endemic disease Although it considers it probable, the WHO has not yet declared covid-19 an endemic disease',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 21,769 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 6 tokens
- mean: 120.55 tokens
- max: 512 tokens
- min: 14 tokens
- mean: 38.75 tokens
- max: 148 tokens
- Samples:
sentence_0 sentence_1 "On January 1, 1979 New York billionaire Brandon Torrent allowed himself to be photographed while urinating on a homeless man sleeping on the street. This image explains, better than many words, the division of the world into social classes that we must eliminate . Meanwhile, in 21st century Brazil, many 'good citizens', just above the homeless condition, applaud politicians and politicians* who support the predatory elite represented by this abject and unworthy human being, who urinates on people who, in the final analysis, are the builders of the fortune he enjoys. Until we realize which side of this stream of urine we are on, we will not be able to build a truly just society. Class consciousness is the true and most urgent education."
This photo shows a billionaire named Brandon Torrent urinating on a homeless man The real story behind the image of a man who appears to urinate on a homeless person
French secret service officer jean claude returns from his mission as imam with deash (isis) like others from several countries in Syria.. there are questions
This man is a French intelligence officer No, this man is not a French intelligence officer
Oh yes! Rohit Sharma Mumbai Indians Burj Khalifa DIEL 82 SAMSUNG MUMBAI INDIANS
Dubai’s Burj Khalifa skyscraper displays photo of Indian cricketer Rohit Sharma This image of the Burj Khalifa has been doctored – the original does not show a projection of Indian cricketer Rohit Sharma
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size
: 2per_device_eval_batch_size
: 2num_train_epochs
: 1multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 2per_device_eval_batch_size
: 2per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss |
---|---|---|
0.0459 | 500 | 0.0329 |
0.0919 | 1000 | 0.0296 |
0.1378 | 1500 | 0.0314 |
0.1837 | 2000 | 0.0199 |
0.2297 | 2500 | 0.0435 |
0.2756 | 3000 | 0.0213 |
0.3215 | 3500 | 0.0293 |
0.3675 | 4000 | 0.0387 |
0.4134 | 4500 | 0.0064 |
0.4593 | 5000 | 0.0338 |
0.5053 | 5500 | 0.0317 |
0.5512 | 6000 | 0.0395 |
0.5972 | 6500 | 0.0129 |
0.6431 | 7000 | 0.036 |
0.6890 | 7500 | 0.0292 |
0.7350 | 8000 | 0.0215 |
0.7809 | 8500 | 0.02 |
0.8268 | 9000 | 0.0215 |
0.8728 | 9500 | 0.0139 |
0.9187 | 10000 | 0.0273 |
0.9646 | 10500 | 0.0138 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 12
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.