sunjupskilling's picture
Add new SentenceTransformer model
dda4251 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:1340
  - loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-base-en-v1.5
widget:
  - source_sentence: Can you tell me about the origin of the word 'Shehnai'?
    sentences:
      - >-
        Krishan Kant (28 February 1927 – 27 July 2002) was the tenth Vice
        President of India from 1997 until his death. Previously, he was
        Governor of Andhra Pradesh from 1990 to 1997.
      - >-
        Acherontia lachesis is a large (up to 13 cm wingspan) Sphingid moth
        found in India and much of the Oriental region, one of the three species
        of Death's-head Hawkmoth, also known as the "Bee Robber".
      - >-
        A Shehnai is a South Asian music instrument which is normally played at
        marriages and other ceremonies, rites and rituals. The word itself is of
        Muslim/Turkish origin, combining 'Sheh' (or 'Shah') 'Royal' and '-Nai'
        or 'Ney', a type of Flute. A version of the "Shehnai", the "Surnai", is
        also played in the Northern and North-western areas of India and
        Pakistan, in particular at traditional Polo matches.
  - source_sentence: How do scammers typically operate in these scams?
    sentences:
      - >-
        The Singapore strategy was a strategy about defending the British Empire
        in the Asian Far East, mainly against the Empire of Japan. The strategy
        involved a number of different plans and stages, developed between 1919
        and 1941. The basic idea was to base a fleet of ships in the Far East.
        This fleet could then be used to stop and defeat a Japanese force
        heading towards India or Australia. In 1919, Singapore was chosen
        because of its strategic location at the end of the Strait of Malacca.
      - >-
        The Non-Cooperation Movement was a significant phase of the Indian
        independence movement from British rule. It was led by Mohandas
        Karamchand Gandhi after the Jallianwala Bagh Massacre. It aimed to
        resist British rule in India through non-violent means or "satyagraha".
        Protestors would refuse to buy British goods, adopt nihal use of local
        handicrafts and picket and liquor shops. The ideas of Ahimsa and
        nonviolence, and Gandhi's ability to rally hundreds of thousands of
        common citizens towards the cause of Indian independence, were first
        seen on a large scale in this movement through the summer 1920. Gandhi
        feared that the movement might lead to popular violence. The
        non-cooperation movement was launched on 12th August, 1921.
      - >-
        A technical support scam is a form of telephone fraud that tricks people
        by pretending that they are a service which helps people fix their
        computers. In most cases they convince the victim they have a computer
        problem that does not actually exist. A common type is when someone gets
        a call from someone (usually from places like India or Pakistan)
        pretending to be from a company that sounds real such as "Microsoft" or
        "Windows" support. Often the caller tries to gain the victim's trust.
        They may use confusing and very technical language to sound authentic.
        They may ask the victim to perform several tasks on their computer.
        Often they target legitimate files on the victim's computer saying these
        are viruses. These tactics are designed to scare people into letting the
        scammer fix the problem (that does not really exist). The caller may
        have the victim install malicious software that could capture sensitive
        data, such as online banking passwords or credit card information.
  - source_sentence: How is Northeast India connected to the rest of India?
    sentences:
      - >-
        Air India Flight 182 was a passenger plane which, on June 23, 1985,
        exploded from a bomb that was placed on the plane. The aircraft was
        going between Montréal-Mirabel International Airport, Montreal, Quebec,
        and New Delhi, India. It was an Air India Boeing 747-237B, registration
        VT-EFO. The bombing was called the largest mass murder in modern
        Canadian history, and the deadliest act of air terrorism before 9/11.
      - >-
        Hinduism is not only a religion but also a way of life. Hinduism is
        widely practiced in South Asia mainly in India and Nepal. Hinduism is
        the oldest religion in the world, and Hindus refer to it as "", "the
        eternal tradition," or the "eternal way," beyond human history. Scholars
        regard Hinduism as a combination of different Indian cultures and
        traditions, with diverse roots. Hinduism has no founder and origins of
        Hinduism is unknown. What we now call Hinduism have roots in cave
        paintings that have been preserved from Mesolithic sites dating from c.
        30,000 BCE in Bhimbetka, near present-day Bhopal, in the Vindhya
        Mountains in the Madhya Pradesh." There was no concept of religion in
        India and Hinduism was not a religion. Hinduism as a religion started to
        develop between 500 BCE and 300 CE, after the Vedic period (1500 BCE to
        500 BCE).
      - >-
        Various groups are involved in the Insurgency in Northeast India,
        India's northeast states, which are connected to the rest of the
        Republic of India by a narrow strip of land known as the Siliguri
        Corridor. In the region several armed factions operate. Some groups call
        for a separate state, others for regional autonomy, while some extreme
        groups demand complete independence.
  - source_sentence: How many songs did Rafi sing during his career?
    sentences:
      - "Inder Kumar Gujral (4 December 1919\_– 30 November 2012) was an Indian politician. He was the 12th Prime Minister of India from April 1997 to March 1998. Gujral was the third Prime Minister to be from the Rajya Sabha."
      - >-
        Mohammed Rafi (, , December 24, 1924 – July 31, 1980) was a popular
        Bollywood playback singer. In a career of over 40 years, Rafi sang more
        than 26,000 songs in the national languages of India and sometimes in
        other languages.
      - >-
        The University of Calcutta (informally known as Calcutta University or
        CU) is a public state university located in Kolkata (formerly
        "Calcutta"), West Bengal, India. It was created on 24 January 1857.
        Within India it is recognized as a "Five-Star University" and a "Centre
        with Potential for Excellence" by the University Grants Commission and
        the National Assessment and Accreditation Council.
  - source_sentence: Who was Lal Bahadur Shastri?
    sentences:
      - >-
        The Bharatiya Janata Party (abbreviated BJP) is one of the two major
        political parties in India. (The second being the Indian National
        Congress). Since the Indian elections in 2014, the BJP has 303 of the
        542 seats in the Lok Sabha, the lower house of the Parliament of India
        and 78 of the 238 seats in Rajya Sabha, the upper house of the
        Parliament of India. Amit Shah is the national president of BJP since
        2014.
      - >-
        Rex Vernon Whitehead (26 October 1948 – 26 June 2014) was an Australian
        Test cricket match umpire and cricketer. He umpired four Test matches
        between 1981 and 1982. His first match was between Australia and India
        in Sydney on 2 January to 4 January 1981. Altogether, he umpired 15
        first-class matches in his career between 1979 and 1983.
      - "Lal Bahadur Shastri (, , 2 October 1904\_– 11 January 1966) was an Indian politician. He was the 2nd Prime Minister of India from 1964 to 1966. He was a senior leader of the Indian National Congress political party."
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sunjupskilling/sunj-bge-base-en-v1.5")
# Run inference
sentences = [
    'Who was Lal Bahadur Shastri?',
    'Lal Bahadur Shastri (, , 2 October 1904\xa0– 11 January 1966) was an Indian politician. He was the 2nd Prime Minister of India from 1964 to 1966. He was a senior leader of the Indian National Congress political party.',
    'Rex Vernon Whitehead (26 October 1948 – 26 June 2014) was an Australian Test cricket match umpire and cricketer. He umpired four Test matches between 1981 and 1982. His first match was between Australia and India in Sydney on 2 January to 4 January 1981. Altogether, he umpired 15 first-class matches in his career between 1979 and 1983.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 1,340 training samples
  • Columns: question and context
  • Approximate statistics based on the first 1000 samples:
    question context
    type string string
    details
    • min: 6 tokens
    • mean: 12.39 tokens
    • max: 24 tokens
    • min: 9 tokens
    • mean: 83.99 tokens
    • max: 510 tokens
  • Samples:
    question context
    What is Basil commonly known as? Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.
    Where is Basil originally native to? Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.
    What is the significance of the Roerich Pact? The Roerich Pact is a treaty on Protection of Artistic and Scientific Institutions and Historic Monuments, signed by the representatives of 21 states in the Oval Office of the White House on 15 April 1935. As of January 1, 1990, the Roerich Pact had been ratified by ten nations: Brazil, Chile, Colombia, Cuba, the Dominican Republic, El Salvador, Guatemala, Mexico, the United States, and Venezuela. It went into effect on 26 August 1935. The Government of India approved the Treaty in 1948, but did not take any further formal action. The Roerich Pact is also known as "Pax Cultura" ("Cultural Peace" or "Peace through Culture"). The most important part of the Roerich Pact is the legal recognition that the protection of culture is always more important than any military necessity.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 100 evaluation samples
  • Columns: question and context
  • Approximate statistics based on the first 100 samples:
    question context
    type string string
    details
    • min: 8 tokens
    • mean: 12.2 tokens
    • max: 20 tokens
    • min: 22 tokens
    • mean: 85.76 tokens
    • max: 510 tokens
  • Samples:
    question context
    What are the bases of political relations between India and Ireland? Indo-Irish relations between the Republic of Ireland and the Republic of India picked up steam during the freedom struggles of the respective countries against a common imperial empire in the United Kingdom. Political relations between the two states have largely been based on socio-cultural ties, although political and economic ties have also helped build relations. Indians recognise Northern Ireland as part of its country.
    When did Rex Whitehead umpire his first Test match? Rex Vernon Whitehead (26 October 1948 – 26 June 2014) was an Australian Test cricket match umpire and cricketer. He umpired four Test matches between 1981 and 1982. His first match was between Australia and India in Sydney on 2 January to 4 January 1981. Altogether, he umpired 15 first-class matches in his career between 1979 and 1983.
    What can you tell me about Nayaganj? Nayaganj is a village in Vaishali District, Bihar, India. It is very close to the river Ganga. It is also a postal office of India
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 3e-06
  • weight_decay: 0.03
  • max_steps: 332
  • warmup_ratio: 0.1
  • warmup_steps: 1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-06
  • weight_decay: 0.03
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3.0
  • max_steps: 332
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 1
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss
0.1190 10 0.0998 -
0.2381 20 0.082 0.0253
0.3571 30 0.0843 -
0.4762 40 0.0496 0.0138
0.5952 50 0.0731 -
0.7143 60 0.0244 0.0093
0.8333 70 0.0338 -
0.9524 80 0.0484 0.0075
1.0714 90 0.0258 -
1.1905 100 0.0226 0.0067
1.3095 110 0.0331 -
1.4286 120 0.0193 0.0061
1.5476 130 0.0299 -
1.6667 140 0.0146 0.0055
1.7857 150 0.0228 -
1.9048 160 0.0543 0.0035
2.0238 170 0.0368 -
2.1429 180 0.025 0.0031
2.2619 190 0.0113 -
2.3810 200 0.0123 0.0029
2.5 210 0.0301 -
2.6190 220 0.0358 0.0027
2.7381 230 0.009 -
2.8571 240 0.01 0.0024
2.9762 250 0.0152 -
3.0952 260 0.013 0.0021
3.2143 270 0.0121 -
3.3333 280 0.012 0.0020
3.4524 290 0.0168 -
3.5714 300 0.0292 0.0019
3.6905 310 0.054 -
3.8095 320 0.0227 0.0019
3.9286 330 0.0144 -

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}