dangvantuan commited on
Commit
d5669f9
·
verified ·
1 Parent(s): 11b15af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -61
README.md CHANGED
@@ -1,90 +1,81 @@
1
- ---
2
- library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
- tags:
5
- - sentence-transformers
6
- - feature-extraction
7
- - sentence-similarity
8
 
9
- ---
10
 
11
- # {MODEL_NAME}
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
 
15
- <!--- Describe your model here -->
16
 
17
- ## Usage (Sentence-Transformers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
 
21
  ```
22
  pip install -U sentence-transformers
 
23
  ```
24
 
25
  Then you can use the model like this:
26
 
27
  ```python
28
  from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('{MODEL_NAME}')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
34
- ```
35
-
36
-
37
 
38
- ## Evaluation Results
39
 
40
- <!--- Describe how your model was evaluated -->
 
41
 
42
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
43
 
44
 
45
- ## Training
46
- The model was trained with the parameters:
47
 
48
- **DataLoader**:
49
 
50
- `torch.utils.data.dataloader.DataLoader` of length 719 with parameters:
51
- ```
52
- {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
53
- ```
54
 
55
- **Loss**:
56
 
57
- `__main__.CosineSimilarityLoss`
58
-
59
- Parameters of the fit()-Method:
60
- ```
61
- {
62
- "epochs": 10,
63
- "evaluation_steps": 1000,
64
- "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
65
- "max_grad_norm": 1,
66
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
67
- "optimizer_params": {
68
- "eps": 1e-06,
69
- "lr": 2e-05
70
- },
71
- "scheduler": "WarmupLinear",
72
- "steps_per_epoch": null,
73
- "warmup_steps": 719,
74
- "weight_decay": 0.01
75
- }
76
- ```
77
 
 
78
 
79
- ## Full Model Architecture
80
- ```
81
- SentenceTransformer(
82
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
83
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
84
- (2): Normalize()
85
- )
86
- ```
87
 
88
- ## Citing & Authors
 
 
 
 
 
89
 
90
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
 
2
 
3
+ # [bilingual-embedding-large](https://huggingface.co/Lajavaness/bilingual-embedding-large)
4
 
5
+ bilingual-embedding is the Embedding Model for bilingual language: french and english. This model is a specialized sentence-embedding trained specifically for the bilingual language, leveraging the robust capabilities of [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-large), a pre-trained language model based on the [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-large) architecture. The model utilizes xlm-roberta to encode english-french sentences into a 1024-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of english-french sentences, reflecting both the lexical and contextual layers of the language.
6
 
 
7
 
8
+ ## Full Model Architecture
9
+ ```
10
+ SentenceTransformer(
11
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel
12
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
13
+ (2): Normalize()
14
+ )
15
+ ```
16
+
17
+ ## Training and Fine-tuning process
18
+ The model underwent a rigorous four-stage training and fine-tuning process, each tailored to enhance its ability to generate precise and contextually relevant sentence embeddings for the Vietnamese language. Below is an outline of these stages:
19
+ #### Stage 1: NLI Training
20
+ - Dataset: [(SNLI+XNLI) for english+french]
21
+ - Method: Training using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.
22
+ ### Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark
23
+ - Dataset: [STSB-fr and en]
24
+ - Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library.
25
+ ### Stage 4: Advanced Augmentation Fine-tuning
26
+ - Dataset: STSB-vn with generate [silver sample from gold sample](https://www.sbert.net/examples/training/data_augmentation/README.html)
27
+ - Method: Employed an advanced strategy using [Augmented SBERT](https://arxiv.org/abs/2010.08240) with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy.
28
+
29
+
30
+ ## Usage:
31
 
32
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
33
 
34
  ```
35
  pip install -U sentence-transformers
36
+ pip install -q pyvi
37
  ```
38
 
39
  Then you can use the model like this:
40
 
41
  ```python
42
  from sentence_transformers import SentenceTransformer
43
+ from pyvi.ViTokenizer import tokenize
 
 
 
 
 
 
 
44
 
45
+ sentences = ["Paris est une capitale de la France", "Paris is a capital of France"]
46
 
47
+ model = SentenceTransformer('Lajavaness/bilingual-embedding-large', trust_remote_code=True)
48
+ print(embeddings)
49
 
50
+ ```
51
 
52
 
 
 
53
 
 
54
 
 
 
 
 
55
 
56
+ ## Evaluation
57
 
58
+ TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ ## Citation
61
 
62
+ @article{conneau2019unsupervised,
63
+ title={Unsupervised cross-lingual representation learning at scale},
64
+ author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
65
+ journal={arXiv preprint arXiv:1911.02116},
66
+ year={2019}
67
+ }
 
 
68
 
69
+ @article{reimers2019sentence,
70
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
71
+ author={Nils Reimers, Iryna Gurevych},
72
+ journal={https://arxiv.org/abs/1908.10084},
73
+ year={2019}
74
+ }
75
 
76
+ @article{thakur2020augmented,
77
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
78
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
79
+ journal={arXiv e-prints},
80
+ pages={arXiv--2010},
81
+ year={2020}