metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:1340
- loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-base-en-v1.5
widget:
- source_sentence: Can you tell me about the origin of the word 'Shehnai'?
sentences:
- >-
Krishan Kant (28 February 1927 – 27 July 2002) was the tenth Vice
President of India from 1997 until his death. Previously, he was
Governor of Andhra Pradesh from 1990 to 1997.
- >-
Acherontia lachesis is a large (up to 13 cm wingspan) Sphingid moth
found in India and much of the Oriental region, one of the three species
of Death's-head Hawkmoth, also known as the "Bee Robber".
- >-
A Shehnai is a South Asian music instrument which is normally played at
marriages and other ceremonies, rites and rituals. The word itself is of
Muslim/Turkish origin, combining 'Sheh' (or 'Shah') 'Royal' and '-Nai'
or 'Ney', a type of Flute. A version of the "Shehnai", the "Surnai", is
also played in the Northern and North-western areas of India and
Pakistan, in particular at traditional Polo matches.
- source_sentence: How do scammers typically operate in these scams?
sentences:
- >-
The Singapore strategy was a strategy about defending the British Empire
in the Asian Far East, mainly against the Empire of Japan. The strategy
involved a number of different plans and stages, developed between 1919
and 1941. The basic idea was to base a fleet of ships in the Far East.
This fleet could then be used to stop and defeat a Japanese force
heading towards India or Australia. In 1919, Singapore was chosen
because of its strategic location at the end of the Strait of Malacca.
- >-
The Non-Cooperation Movement was a significant phase of the Indian
independence movement from British rule. It was led by Mohandas
Karamchand Gandhi after the Jallianwala Bagh Massacre. It aimed to
resist British rule in India through non-violent means or "satyagraha".
Protestors would refuse to buy British goods, adopt nihal use of local
handicrafts and picket and liquor shops. The ideas of Ahimsa and
nonviolence, and Gandhi's ability to rally hundreds of thousands of
common citizens towards the cause of Indian independence, were first
seen on a large scale in this movement through the summer 1920. Gandhi
feared that the movement might lead to popular violence. The
non-cooperation movement was launched on 12th August, 1921.
- >-
A technical support scam is a form of telephone fraud that tricks people
by pretending that they are a service which helps people fix their
computers. In most cases they convince the victim they have a computer
problem that does not actually exist. A common type is when someone gets
a call from someone (usually from places like India or Pakistan)
pretending to be from a company that sounds real such as "Microsoft" or
"Windows" support. Often the caller tries to gain the victim's trust.
They may use confusing and very technical language to sound authentic.
They may ask the victim to perform several tasks on their computer.
Often they target legitimate files on the victim's computer saying these
are viruses. These tactics are designed to scare people into letting the
scammer fix the problem (that does not really exist). The caller may
have the victim install malicious software that could capture sensitive
data, such as online banking passwords or credit card information.
- source_sentence: How is Northeast India connected to the rest of India?
sentences:
- >-
Air India Flight 182 was a passenger plane which, on June 23, 1985,
exploded from a bomb that was placed on the plane. The aircraft was
going between Montréal-Mirabel International Airport, Montreal, Quebec,
and New Delhi, India. It was an Air India Boeing 747-237B, registration
VT-EFO. The bombing was called the largest mass murder in modern
Canadian history, and the deadliest act of air terrorism before 9/11.
- >-
Hinduism is not only a religion but also a way of life. Hinduism is
widely practiced in South Asia mainly in India and Nepal. Hinduism is
the oldest religion in the world, and Hindus refer to it as "", "the
eternal tradition," or the "eternal way," beyond human history. Scholars
regard Hinduism as a combination of different Indian cultures and
traditions, with diverse roots. Hinduism has no founder and origins of
Hinduism is unknown. What we now call Hinduism have roots in cave
paintings that have been preserved from Mesolithic sites dating from c.
30,000 BCE in Bhimbetka, near present-day Bhopal, in the Vindhya
Mountains in the Madhya Pradesh." There was no concept of religion in
India and Hinduism was not a religion. Hinduism as a religion started to
develop between 500 BCE and 300 CE, after the Vedic period (1500 BCE to
500 BCE).
- >-
Various groups are involved in the Insurgency in Northeast India,
India's northeast states, which are connected to the rest of the
Republic of India by a narrow strip of land known as the Siliguri
Corridor. In the region several armed factions operate. Some groups call
for a separate state, others for regional autonomy, while some extreme
groups demand complete independence.
- source_sentence: How many songs did Rafi sing during his career?
sentences:
- "Inder Kumar Gujral (4 December 1919\_– 30 November 2012) was an Indian politician. He was the 12th Prime Minister of India from April 1997 to March 1998. Gujral was the third Prime Minister to be from the Rajya Sabha."
- >-
Mohammed Rafi (, , December 24, 1924 – July 31, 1980) was a popular
Bollywood playback singer. In a career of over 40 years, Rafi sang more
than 26,000 songs in the national languages of India and sometimes in
other languages.
- >-
The University of Calcutta (informally known as Calcutta University or
CU) is a public state university located in Kolkata (formerly
"Calcutta"), West Bengal, India. It was created on 24 January 1857.
Within India it is recognized as a "Five-Star University" and a "Centre
with Potential for Excellence" by the University Grants Commission and
the National Assessment and Accreditation Council.
- source_sentence: Who was Lal Bahadur Shastri?
sentences:
- >-
The Bharatiya Janata Party (abbreviated BJP) is one of the two major
political parties in India. (The second being the Indian National
Congress). Since the Indian elections in 2014, the BJP has 303 of the
542 seats in the Lok Sabha, the lower house of the Parliament of India
and 78 of the 238 seats in Rajya Sabha, the upper house of the
Parliament of India. Amit Shah is the national president of BJP since
2014.
- >-
Rex Vernon Whitehead (26 October 1948 – 26 June 2014) was an Australian
Test cricket match umpire and cricketer. He umpired four Test matches
between 1981 and 1982. His first match was between Australia and India
in Sydney on 2 January to 4 January 1981. Altogether, he umpired 15
first-class matches in his career between 1979 and 1983.
- "Lal Bahadur Shastri (, , 2 October 1904\_– 11 January 1966) was an Indian politician. He was the 2nd Prime Minister of India from 1964 to 1966. He was a senior leader of the Indian National Congress political party."
pipeline_tag: sentence-similarity
library_name: sentence-transformers
SentenceTransformer based on BAAI/bge-base-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sunjupskilling/sunj-bge-base-en-v1.5")
# Run inference
sentences = [
'Who was Lal Bahadur Shastri?',
'Lal Bahadur Shastri (, , 2 October 1904\xa0– 11 January 1966) was an Indian politician. He was the 2nd Prime Minister of India from 1964 to 1966. He was a senior leader of the Indian National Congress political party.',
'Rex Vernon Whitehead (26 October 1948 – 26 June 2014) was an Australian Test cricket match umpire and cricketer. He umpired four Test matches between 1981 and 1982. His first match was between Australia and India in Sydney on 2 January to 4 January 1981. Altogether, he umpired 15 first-class matches in his career between 1979 and 1983.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 1,340 training samples
- Columns:
question
andcontext
- Approximate statistics based on the first 1000 samples:
question context type string string details - min: 6 tokens
- mean: 12.39 tokens
- max: 24 tokens
- min: 9 tokens
- mean: 83.99 tokens
- max: 510 tokens
- Samples:
question context What is Basil commonly known as?
Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.
Where is Basil originally native to?
Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.
What is the significance of the Roerich Pact?
The Roerich Pact is a treaty on Protection of Artistic and Scientific Institutions and Historic Monuments, signed by the representatives of 21 states in the Oval Office of the White House on 15 April 1935. As of January 1, 1990, the Roerich Pact had been ratified by ten nations: Brazil, Chile, Colombia, Cuba, the Dominican Republic, El Salvador, Guatemala, Mexico, the United States, and Venezuela. It went into effect on 26 August 1935. The Government of India approved the Treaty in 1948, but did not take any further formal action. The Roerich Pact is also known as "Pax Cultura" ("Cultural Peace" or "Peace through Culture"). The most important part of the Roerich Pact is the legal recognition that the protection of culture is always more important than any military necessity.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset
Unnamed Dataset
- Size: 100 evaluation samples
- Columns:
question
andcontext
- Approximate statistics based on the first 100 samples:
question context type string string details - min: 8 tokens
- mean: 12.2 tokens
- max: 20 tokens
- min: 22 tokens
- mean: 85.76 tokens
- max: 510 tokens
- Samples:
question context What are the bases of political relations between India and Ireland?
Indo-Irish relations between the Republic of Ireland and the Republic of India picked up steam during the freedom struggles of the respective countries against a common imperial empire in the United Kingdom. Political relations between the two states have largely been based on socio-cultural ties, although political and economic ties have also helped build relations. Indians recognise Northern Ireland as part of its country.
When did Rex Whitehead umpire his first Test match?
Rex Vernon Whitehead (26 October 1948 – 26 June 2014) was an Australian Test cricket match umpire and cricketer. He umpired four Test matches between 1981 and 1982. His first match was between Australia and India in Sydney on 2 January to 4 January 1981. Altogether, he umpired 15 first-class matches in his career between 1979 and 1983.
What can you tell me about Nayaganj?
Nayaganj is a village in Vaishali District, Bihar, India. It is very close to the river Ganga. It is also a postal office of India
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16learning_rate
: 3e-06weight_decay
: 0.03max_steps
: 332warmup_ratio
: 0.1warmup_steps
: 1fp16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 3e-06weight_decay
: 0.03adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3.0max_steps
: 332lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 1log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss |
---|---|---|---|
0.1190 | 10 | 0.0998 | - |
0.2381 | 20 | 0.082 | 0.0253 |
0.3571 | 30 | 0.0843 | - |
0.4762 | 40 | 0.0496 | 0.0138 |
0.5952 | 50 | 0.0731 | - |
0.7143 | 60 | 0.0244 | 0.0093 |
0.8333 | 70 | 0.0338 | - |
0.9524 | 80 | 0.0484 | 0.0075 |
1.0714 | 90 | 0.0258 | - |
1.1905 | 100 | 0.0226 | 0.0067 |
1.3095 | 110 | 0.0331 | - |
1.4286 | 120 | 0.0193 | 0.0061 |
1.5476 | 130 | 0.0299 | - |
1.6667 | 140 | 0.0146 | 0.0055 |
1.7857 | 150 | 0.0228 | - |
1.9048 | 160 | 0.0543 | 0.0035 |
2.0238 | 170 | 0.0368 | - |
2.1429 | 180 | 0.025 | 0.0031 |
2.2619 | 190 | 0.0113 | - |
2.3810 | 200 | 0.0123 | 0.0029 |
2.5 | 210 | 0.0301 | - |
2.6190 | 220 | 0.0358 | 0.0027 |
2.7381 | 230 | 0.009 | - |
2.8571 | 240 | 0.01 | 0.0024 |
2.9762 | 250 | 0.0152 | - |
3.0952 | 260 | 0.013 | 0.0021 |
3.2143 | 270 | 0.0121 | - |
3.3333 | 280 | 0.012 | 0.0020 |
3.4524 | 290 | 0.0168 | - |
3.5714 | 300 | 0.0292 | 0.0019 |
3.6905 | 310 | 0.054 | - |
3.8095 | 320 | 0.0227 | 0.0019 |
3.9286 | 330 | 0.0144 | - |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.0
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}