silma-ai
/

silma-embedding-sts-v0.1

@@ -18,34 +18,7 @@ tags:
 - sentence-similarity
 - feature-extraction
 - generated_from_trainer
-- dataset_size:34436
 - loss:CosineSimilarityLoss
-widget:
-- source_sentence: Three men are playing chess.
-  sentences:
-  - Two men are fighting.
-  - امرأة تحمل و تحمل طفل كنغر
-  - Two men are playing chess.
-- source_sentence: Two men are playing chess.
-  sentences:
-  - رجل يعزف على الغيتار و يغني
-  - Three men are playing chess.
-  - طائرة طيران تقلع
-- source_sentence: Two men are playing chess.
-  sentences:
-  - A man is playing a large flute. رجل يعزف على ناي كبير
-  - The man is playing the piano. الرجل يعزف على البيانو
-  - Three men are playing chess.
-- source_sentence: الرجل يعزف على البيانو The man is playing the piano.
-  sentences:
-  - رجل يجلس ويلعب الكمان A man seated is playing the cello.
-  - ثلاثة رجال يلعبون الشطرنج.
-  - الرجل يعزف على الغيتار The man is playing the guitar.
-- source_sentence: الرجل ضرب الرجل الآخر بعصا The man hit the other man with a stick.
-  sentences:
-  - الرجل صفع الرجل الآخر بعصا The man spanked the other man with a stick.
-  - A plane is taking off.
-  - A man is smoking. رجل يدخن
 model-index:
 - name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1
   results:
@@ -123,6 +96,10 @@ model-index:
     - type: spearman_max
       value: 0.8530609768738506
       name: Spearman Max
 ---
 # SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1
@@ -133,28 +110,10 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [s
 ### Model Description
 - **Model Type:** Sentence Transformer
-- **Base model:** [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1) <!-- at revision 9eb50734f432656a01e1f88d28fa9a6fe8b9e148 -->
 - **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 768 tokens
 - **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-)
-```
 ## Usage
@@ -166,26 +125,160 @@ First install the Sentence Transformers library:
 pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
 model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1")
-# Run inference
-sentences = [
-    'الرجل ضرب الرجل الآخر بعصا The man hit the other man with a stick.',
-    'الرجل صفع الرجل الآخر بعصا The man spanked the other man with a stick.',
-    'A man is smoking. رجل يدخن',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
 ```
 <!--
@@ -264,188 +357,52 @@ You can finetune this model on your own dataset.
 ## Training Details
-### Training Dataset
-#### Unnamed Dataset
-* Size: 34,436 training samples
-* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
-* Approximate statistics based on the first 1000 samples:
-  |         | sentence1                                                                         | sentence2                                                                         | score                                                          |
-  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------|
-  | type    | string                                                                            | string                                                                            | float                                                          |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 15.18 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 15.18 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.54</li><li>max: 1.0</li></ul> |
-* Samples:
-  | sentence1                                                                                          | sentence2                                                                                          | score             |
-  |:---------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------|:------------------|
-  | <code>A woman picks up and holds a baby kangaroo in her arms. امرأة تحمل في ذراعها طفل كنغر</code> | <code>A woman picks up and holds a baby kangaroo. امرأة تحمل و تحمل طفل كنغر</code>                | <code>0.92</code> |
-  | <code>امرأة تحمل و تحمل طفل كنغر A woman picks up and holds a baby kangaroo.</code>                | <code>امرأة تحمل في ذراعها طفل كنغر A woman picks up and holds a baby kangaroo in her arms.</code> | <code>0.92</code> |
-  | <code>رجل يعزف على الناي</code>                                                                    | <code>رجل يعزف على فرقة الخيزران</code>                                                            | <code>0.77</code> |
-* Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
-  ```json
-  {
-      "loss_fct": "torch.nn.modules.loss.MSELoss"
-  }
-  ```
-### Evaluation Dataset
-#### Unnamed Dataset
-* Size: 100 evaluation samples
-* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
-* Approximate statistics based on the first 100 samples:
-  |         | sentence1                                                                         | sentence2                                                                         | score                                                          |
-  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------|
-  | type    | string                                                                            | string                                                                            | float                                                          |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 15.96 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 15.96 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>min: 0.1</li><li>mean: 0.72</li><li>max: 1.0</li></ul> |
-* Samples:
-  | sentence1                           | sentence2                                | score            |
-  |:------------------------------------|:-----------------------------------------|:-----------------|
-  | <code>طائرة ستقلع</code>            | <code>طائرة طيران تقلع</code>            | <code>1.0</code> |
-  | <code>طائرة طيران تقلع</code>       | <code>طائرة ستقلع</code>                 | <code>1.0</code> |
-  | <code>A plane is taking off.</code> | <code>An air plane is taking off.</code> | <code>1.0</code> |
-* Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
-  ```json
-  {
-      "loss_fct": "torch.nn.modules.loss.MSELoss"
-  }
-  ```
-### Training Hyperparameters
-#### Non-Default Hyperparameters
-- `eval_strategy`: steps
 - `per_device_train_batch_size`: 250
 - `per_device_eval_batch_size`: 10
-- `learning_rate`: 1e-06
-- `num_train_epochs`: 10
 - `bf16`: True
 - `dataloader_drop_last`: True
 - `optim`: adamw_torch_fused
 - `batch_sampler`: no_duplicates
-#### All Hyperparameters
-<details><summary>Click to expand</summary>
-- `overwrite_output_dir`: False
-- `do_predict`: False
 - `eval_strategy`: steps
-- `prediction_loss_only`: True
 - `per_device_train_batch_size`: 250
 - `per_device_eval_batch_size`: 10
-- `per_gpu_train_batch_size`: None
-- `per_gpu_eval_batch_size`: None
-- `gradient_accumulation_steps`: 1
-- `eval_accumulation_steps`: None
-- `torch_empty_cache_steps`: None
 - `learning_rate`: 1e-06
-- `weight_decay`: 0.0
-- `adam_beta1`: 0.9
-- `adam_beta2`: 0.999
-- `adam_epsilon`: 1e-08
-- `max_grad_norm`: 1.0
 - `num_train_epochs`: 10
-- `max_steps`: -1
-- `lr_scheduler_type`: linear
-- `lr_scheduler_kwargs`: {}
-- `warmup_ratio`: 0.0
-- `warmup_steps`: 0
-- `log_level`: passive
-- `log_level_replica`: warning
-- `log_on_each_node`: True
-- `logging_nan_inf_filter`: True
-- `save_safetensors`: True
-- `save_on_each_node`: False
-- `save_only_model`: False
-- `restore_callback_states_from_checkpoint`: False
-- `no_cuda`: False
-- `use_cpu`: False
-- `use_mps_device`: False
-- `seed`: 42
-- `data_seed`: None
-- `jit_mode_eval`: False
-- `use_ipex`: False
 - `bf16`: True
-- `fp16`: False
-- `fp16_opt_level`: O1
-- `half_precision_backend`: auto
-- `bf16_full_eval`: False
-- `fp16_full_eval`: False
-- `tf32`: None
-- `local_rank`: 0
-- `ddp_backend`: None
-- `tpu_num_cores`: None
-- `tpu_metrics_debug`: False
-- `debug`: []
 - `dataloader_drop_last`: True
-- `dataloader_num_workers`: 0
-- `dataloader_prefetch_factor`: None
-- `past_index`: -1
-- `disable_tqdm`: False
-- `remove_unused_columns`: True
-- `label_names`: None
-- `load_best_model_at_end`: False
-- `ignore_data_skip`: False
-- `fsdp`: []
-- `fsdp_min_num_params`: 0
-- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
-- `fsdp_transformer_layer_cls_to_wrap`: None
-- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
-- `deepspeed`: None
-- `label_smoothing_factor`: 0.0
 - `optim`: adamw_torch_fused
-- `optim_args`: None
-- `adafactor`: False
-- `group_by_length`: False
-- `length_column_name`: length
-- `ddp_find_unused_parameters`: None
-- `ddp_bucket_cap_mb`: None
-- `ddp_broadcast_buffers`: False
-- `dataloader_pin_memory`: True
-- `dataloader_persistent_workers`: False
-- `skip_memory_metrics`: True
-- `use_legacy_prediction_loop`: False
-- `push_to_hub`: False
-- `resume_from_checkpoint`: None
-- `hub_model_id`: None
-- `hub_strategy`: every_save
-- `hub_private_repo`: False
-- `hub_always_push`: False
-- `gradient_checkpointing`: False
-- `gradient_checkpointing_kwargs`: None
-- `include_inputs_for_metrics`: False
-- `eval_do_concat_batches`: True
-- `fp16_backend`: auto
-- `push_to_hub_model_id`: None
-- `push_to_hub_organization`: None
-- `mp_parameters`:
-- `auto_find_batch_size`: False
-- `full_determinism`: False
-- `torchdynamo`: None
-- `ray_scope`: last
-- `ddp_timeout`: 1800
-- `torch_compile`: False
-- `torch_compile_backend`: None
-- `torch_compile_mode`: None
-- `dispatch_batches`: None
-- `split_batches`: None
-- `include_tokens_per_second`: False
-- `include_num_input_tokens_seen`: False
-- `neftune_noise_alpha`: None
-- `optim_target_modules`: None
-- `batch_eval_metrics`: False
-- `eval_on_start`: False
-- `use_liger_kernel`: False
-- `eval_use_gather_object`: False
 - `batch_sampler`: no_duplicates
-- `multi_dataset_batch_sampler`: proportional
 </details>
-### Training Logs
 | Epoch  | Step | Training Loss | Validation Loss | sts-dev-512_spearman_cosine | sts-dev-256_spearman_cosine |
 |:------:|:----:|:-------------:|:---------------:|:---------------------------:|:---------------------------:|
 | 0.3650 | 50   | 0.0395        | 0.0424          | 0.8486                      | 0.8487                      |

 - sentence-similarity
 - feature-extraction
 - generated_from_trainer
 - loss:CosineSimilarityLoss
 model-index:
 - name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1
   results:
     - type: spearman_max
       value: 0.8530609768738506
       name: Spearman Max
+license: apache-2.0
+language:
+- ar
+- en
 ---
 # SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1
 ### Model Description
 - **Model Type:** Sentence Transformer
+- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02)
 - **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 768 tokens
 - **Similarity Function:** Cosine Similarity
 ## Usage
 pip install -U sentence-transformers
 ```
+Then load the model
 ```python
 from sentence_transformers import SentenceTransformer
+from sentence_transformers.util import cos_sim
 model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1")
+```
+### Samples
+#### [+] Short Sentence Similarity
+**Arabic**
+```python
+query = "الطقس اليوم مشمس"
+sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
+sentence_2 = "الطقس اليوم غائم"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.42602288722991943
+# sentence_2_similarity: 0.10798501968383789
+# =======
+```
+**English**
+```python
+query = "The weather is sunny today"
+sentence_1 = "The morning was bright and sunny"
+sentence_2 = "it is too cloudy today"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.5796191692352295
+# sentence_2_similarity: 0.21948376297950745
+# =======
+```
+#### [+] Long Sentence Similarity
+**Arabic**
+```python
+query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
+sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
+sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.5725120306015015
+# sentence_2_similarity: 0.22617210447788239
+# =======
+```
+**English**
+```python
+query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks"
+sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector."
+sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks."
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.6438770294189453
+# sentence_2_similarity: 0.4720292389392853
+# =======
+```
+#### [+] Question to Paragraph Matching
+**Arabic**
+```python
+query = "ما هي فوائد ممارسة الرياضة؟"
+sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
+sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.6058318614959717
+# sentence_2_similarity: 0.006831036880612373
+# =======
+```
+**English**
+```python
+query = "What are the benefits of exercising?"
+sentence_1 = "Regular exercise helps improve overall health and physical fitness"
+sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.3593001365661621
+# sentence_2_similarity: 0.06493218243122101
+# =======
+```
+#### [+] Message to Intent-Name Mapping
+**Arabic**
+```python
+query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
+sentence_1 = "حجز رحلة"
+sentence_2 = "إلغاء حجز"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.4646468162536621
+# sentence_2_similarity: 0.19563665986061096
+# =======
+```
+**ُEnglish**
+```python
+query = "Please send and email to all of the managers"
+sentence_1 = "send email"
+sentence_2 = "read inbox emails"
+query_embedding = model.encode(query)
+print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
+print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())
+# ======= Output
+# sentence_1_similarity: 0.6096147298812866
+# sentence_2_similarity: 0.42170101404190063
+# =======
 ```
 <!--
 ## Training Details
+This model was finetunned via 2 pahases:
+### Phase 1:
+In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which
+contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples.
+Only the first `600` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning.
+Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters:
 - `per_device_train_batch_size`: 250
 - `per_device_eval_batch_size`: 10
+- `learning_rate`: 1e-05
+- `num_train_epochs`: 3
 - `bf16`: True
 - `dataloader_drop_last`: True
 - `optim`: adamw_torch_fused
 - `batch_sampler`: no_duplicates
+**[trainin-example](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**
+### Phase 2:
+In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which
+contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples.
+Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning.
+Phase `1` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters:
 - `eval_strategy`: steps
 - `per_device_train_batch_size`: 250
 - `per_device_eval_batch_size`: 10
 - `learning_rate`: 1e-06
 - `num_train_epochs`: 10
 - `bf16`: True
 - `dataloader_drop_last`: True
 - `optim`: adamw_torch_fused
 - `batch_sampler`: no_duplicates
+**[trainin-example](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)**
 </details>
+### Training Logs (Phase 2)
 | Epoch  | Step | Training Loss | Validation Loss | sts-dev-512_spearman_cosine | sts-dev-256_spearman_cosine |
 |:------:|:----:|:-------------:|:---------------:|:---------------------------:|:---------------------------:|
 | 0.3650 | 50   | 0.0395        | 0.0424          | 0.8486                      | 0.8487                      |