tes_upload / README.md
yahyaabd's picture
Upload model from local path: models\paraphrase-multilingual-miniLM-L12-v2-finetuned-bps-all
97edd48 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:350
  - loss:MultipleNegativesRankingLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
  - source_sentence: >-
      Data pengeluaran bulanan rumah tangga pedesaan untuk konsumsi makanan dan
      non-makanan per provinsi, tahun berapa saja tersedia?
    sentences:
      - Sistem Neraca Sosial Ekonomi Indonesia Tahun 2022 (84 x 84)
      - >-
        Persentase RataRata Pengeluaran per Kapita Sebulan Untuk Makanan dan
        Bukan Makanan di Daerah Perdesaan Menurut Provinsi, 2007-2024
      - >-
        Nilai Impor Jawa Madura Menurut Pelabuhan Impor di Pulau Jawa Madura
        Tahun 2009 - 2013 (Juta US $) 1)
  - source_sentence: Asal impor gula Indonesia periode 2017 hingga 2023
    sentences:
      - >-
        Banyaknya Anggota Kadinda Menurut Kabupaten/Kota di Provinsi Jawa
        Tengah, 2019
      - Impor Gula menurut Negara Asal Utama, 2017-2023
      - >-
        Rata-rata Pendapatan Bersih Pekerja Bebas Menurut Provinsi dan Kelompok
        Umur, 2023
  - source_sentence: Laju kehilangan hutan Indonesia dalam dan luar kawasan hutan 2013-2022.
    sentences:
      - >-
        Institusi Pemerintah Neraca Institusi Terintegrasi (Triliun Rupiah),
        2016 2023
      - >-
        Angka Deforestasi (Netto) Indonesia di Dalam dan di Luar Kawasan Hutan
        Tahun 2013-2022 (Ha/Th)
      - >-
        Produksi Perkebunan Menurut Kabupaten/Kota dan Jenis Tanaman di Provinsi
        Jawa Tengah (ton), 2021 dan 2022
  - source_sentence: Kemana saja lada putih Indonesia diekspor pada periode 2012 sampai 2023?
    sentences:
      - >-
        Rata-rata Pendapatan Bersih Pekerja Bebas Menurut Provinsi dan Kelompok
        Umur, 2022-2023
      - Ekspor Lada Putih menurut Negara Tujuan Utama, 2012-2023
      - >-
        Angka Kelahiran Kasar (Crude Birth Rate) Hasil Long Form SP2020 Menurut
        Provinsi/Kabupaten/Kota, 2020
  - source_sentence: >-
      data gaji bersih pegawai per bulan tahun 2023 berdasarkan pendidikan dan
      jenis pekerjaan utama
    sentences:
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Pendidikan Tertinggi yang Ditamatkan dan Jenis Pekerjaan Utama, 2023
      - >-
        Banyaknya Kunjungan Kapal Melalui Pelabuhan Jepara Menurut Jenis
        Pelayaran Tahun 2009 - 2013
      - Ekspor Sarang Burung menurut Negara Tujuan Utama, 2012-2023
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: bps val mfd all
          type: bps-val-mfd-all
        metrics:
          - type: cosine_accuracy@1
            value: 0.9861111111111112
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.9861111111111112
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9861111111111112
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9861111111111112
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.9861111111111112
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.9351851851851851
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.9055555555555554
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.8333333333333334
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.016151592322246593
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.0425075387306992
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.06836160354671791
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.11202747994449548
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8706665539282586
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.9861111111111112
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.44673547368787836
            name: Cosine Map@100

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the csv dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'data gaji bersih pegawai per bulan tahun 2023 berdasarkan pendidikan dan jenis pekerjaan utama',
    'Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut Pendidikan Tertinggi yang Ditamatkan dan Jenis Pekerjaan Utama, 2023',
    'Banyaknya Kunjungan Kapal Melalui Pelabuhan Jepara Menurut Jenis Pelayaran Tahun 2009 - 2013',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.9861
cosine_accuracy@3 0.9861
cosine_accuracy@5 0.9861
cosine_accuracy@10 0.9861
cosine_precision@1 0.9861
cosine_precision@3 0.9352
cosine_precision@5 0.9056
cosine_precision@10 0.8333
cosine_recall@1 0.0162
cosine_recall@3 0.0425
cosine_recall@5 0.0684
cosine_recall@10 0.112
cosine_ndcg@10 0.8707
cosine_mrr@10 0.9861
cosine_map@100 0.4467

Training Details

Training Dataset

csv

  • Dataset: csv
  • Size: 350 training samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 350 samples:
    query positive negative
    type string string string
    details
    • min: 7 tokens
    • mean: 16.16 tokens
    • max: 31 tokens
    • min: 9 tokens
    • mean: 23.2 tokens
    • max: 49 tokens
    • min: 5 tokens
    • mean: 27.02 tokens
    • max: 59 tokens
  • Samples:
    query positive negative
    Bagaimana pengeluaran rumah tangga per orang di Indonesia berubah dari 2010 sampai 2024? Distribusi Pembagian Pengeluaran per Kapita dan Indeks Gini, 2010-2024 Proyeksi Beban Pencemaran Udara Menurut Industri di Jawa Tengah Tahun 2020 (Ton/Tahun)
    Data kesenjangan pendapatan di Indonesia tahun 2010-2024: indeks Gini dan pengeluaran rata-rata. Distribusi Pembagian Pengeluaran per Kapita dan Indeks Gini, 2010-2024 Banyaknya Mahasiswa dan Dosen Pada Perguruan Tinggi Agama Islam Swasta di Jawa Tengah, 2018/2019
    Berapa konsumsi makanan pokok per orang per minggu di Indonesia tahun 2007-2024? Rata-Rata Konsumsi per Kapita Seminggu Beberapa Macam Bahan Makanan Penting, 2007-2024 Rekapitulasi Industri Non Formal Yang Baru Menurut Kabupaten/kota 2012
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • weight_decay: 0.01
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step bps-val-mfd-all_cosine_ndcg@10
0.9091 10 0.8300
1.8182 20 0.8736
2.7273 30 0.8707
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.11
  • Sentence Transformers: 3.4.0
  • Transformers: 4.53.1
  • PyTorch: 2.7.1+cpu
  • Accelerate: 1.8.1
  • Datasets: 3.6.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}