File size: 4,083 Bytes
e99f142
 
5efd264
e99f142
18a2d8b
 
 
 
 
5efd264
 
18a2d8b
 
 
 
 
 
 
 
e99f142
 
18a2d8b
e99f142
18a2d8b
e99f142
 
 
18a2d8b
 
 
 
 
 
 
390e56c
 
18a2d8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e99f142
 
 
 
 
 
 
 
 
 
 
18a2d8b
e99f142
18a2d8b
e99f142
 
 
 
18a2d8b
e99f142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18a2d8b
e99f142
 
18a2d8b
 
e99f142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18a2d8b
e99f142
 
18a2d8b
e99f142
 
4c6abf8
 
 
 
9c2d4d3
789a72d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
pipeline_tag: sentence-similarity

tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
  - dpr
  - bn
  - multilingual
widget:
- source_sentence: "আমি বাংলায় গান গাই"
  sentences:
    - "I sing in Bangla"
    - "I sing in Bengali"
    - "I sing in English"
    - "আমি গান গাই না "
  example_title: "Singing"
---

# `s-xlmr-bn`

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like **clustering** or **semantic search**.

<!--- Describe your model here -->

## Model Details

- Model name: s-xlmr-bn
- Model version: 1.0
- Architecture: Sentence Transformer
- Language: Multilingual ( fine-tuned for Bengali Language)
- Base Models: 
    - [paraphrase-distilroberta-base-v2](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2) [Teacher Model]
    - [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) [Student Model]

## Training

The model was fine-tuned using  **Multilingual Knowledge Distillation** method. We took `paraphrase-distilroberta-base-v2` as the teacher model and  `xlm-roberta-large` as the student model.



![image](https://i.ibb.co/8Xrgnfr/sentence-transformer-model.png)

## Intended Use:

- **Primary Use Case:** Semantic similarity, clustering, and semantic searches
- **Potential Use Cases:** Document retrieval, information retrieval, recommendation systems, chatbot systems , FAQ system

## Usage

### Using Sentence-Transformers

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

model = SentenceTransformer('afschowdhury/s-xlmr-bn')
embeddings = model.encode(sentences)
print(embeddings)
```

### Using HuggingFace Transformers

Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('afschowdhury/s-xlmr-bn')
model = AutoModel.from_pretrained('afschowdhury/s-xlmr-bn')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)
```

## Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```


### Point of Contact
**Asif Faisal Chowdhury**  
E-mail: [[email protected]](mailto:[email protected]) | Linked-in: [afschowdhury](https://www.linkedin.com/in/afschowdhury)