Mohamadlh commited on
Commit
a407eb3
·
1 Parent(s): 6b9672e

Upload 10 files

Browse files
.gitattributes CHANGED
@@ -25,7 +25,6 @@
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
  saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
  *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
  *.tflite filter=lfs diff=lfs merge=lfs -text
30
  *.tgz filter=lfs diff=lfs merge=lfs -text
31
  *.wasm filter=lfs diff=lfs merge=lfs -text
 
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
  saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
  *.tar.* filter=lfs diff=lfs merge=lfs -text
 
28
  *.tflite filter=lfs diff=lfs merge=lfs -text
29
  *.tgz filter=lfs diff=lfs merge=lfs -text
30
  *.wasm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - sentiment-analysis
5
+ - text-classification
6
+ - zero-shot-distillation
7
+ - distillation
8
+ - zero-shot-classification
9
+ - debarta-v3
10
+ model-index:
11
+ - name: distilbert-base-multilingual-cased-sentiments-student
12
+ results: []
13
+ datasets:
14
+ - tyqiangz/multilingual-sentiments
15
+ language:
16
+ - en
17
+ - ar
18
+ - de
19
+ - es
20
+ - fr
21
+ - ja
22
+ - zh
23
+ - id
24
+ - hi
25
+ - it
26
+ - ms
27
+ - pt
28
+ ---
29
+
30
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
31
+ should probably proofread and complete it, then remove this comment. -->
32
+
33
+ # distilbert-base-multilingual-cased-sentiments-student
34
+
35
+ This model is distilled from the zero-shot classification pipeline on the Multilingual Sentiment
36
+ dataset using this [script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/zero-shot-distillation).
37
+
38
+ In reality the multilingual-sentiment dataset is annotated of course,
39
+ but we'll pretend and ignore the annotations for the sake of example.
40
+
41
+
42
+ Teacher model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
43
+ Teacher hypothesis template: "The sentiment of this text is {}."
44
+ Student model: distilbert-base-multilingual-cased
45
+
46
+
47
+ ## Inference example
48
+
49
+ ```python
50
+ from transformers import pipeline
51
+
52
+ distilled_student_sentiment_classifier = pipeline(
53
+ model="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
54
+ return_all_scores=True
55
+ )
56
+
57
+ # english
58
+ distilled_student_sentiment_classifier ("I love this movie and i would watch it again and again!")
59
+ >> [[{'label': 'positive', 'score': 0.9731044769287109},
60
+ {'label': 'neutral', 'score': 0.016910076141357422},
61
+ {'label': 'negative', 'score': 0.009985478594899178}]]
62
+
63
+ # malay
64
+ distilled_student_sentiment_classifier("Saya suka filem ini dan saya akan menontonnya lagi dan lagi!")
65
+ [[{'label': 'positive', 'score': 0.9760093688964844},
66
+ {'label': 'neutral', 'score': 0.01804516464471817},
67
+ {'label': 'negative', 'score': 0.005945465061813593}]]
68
+
69
+ # japanese
70
+ distilled_student_sentiment_classifier("私はこの映画が大好きで、何度も見ます!")
71
+ >> [[{'label': 'positive', 'score': 0.9342429041862488},
72
+ {'label': 'neutral', 'score': 0.040193185210227966},
73
+ {'label': 'negative', 'score': 0.025563929229974747}]]
74
+
75
+
76
+ ```
77
+
78
+
79
+ ## Training procedure
80
+
81
+ Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/Distilling_Zero_Shot_multilingual_distilbert_sentiments_student.ipynb)
82
+
83
+ ### Training hyperparameters
84
+
85
+ Result can be reproduce using the following commands:
86
+
87
+ ```bash
88
+ python transformers/examples/research_projects/zero-shot-distillation/distill_classifier.py \
89
+ --data_file ./multilingual-sentiments/train_unlabeled.txt \
90
+ --class_names_file ./multilingual-sentiments/class_names.txt \
91
+ --hypothesis_template "The sentiment of this text is {}." \
92
+ --teacher_name_or_path MoritzLaurer/mDeBERTa-v3-base-mnli-xnli \
93
+ --teacher_batch_size 32 \
94
+ --student_name_or_path distilbert-base-multilingual-cased \
95
+ --output_dir ./distilbert-base-multilingual-cased-sentiments-student \
96
+ --per_device_train_batch_size 16 \
97
+ --fp16
98
+ ```
99
+
100
+ If you are training this model on Colab, make the following code changes to avoid Out-of-memory error message:
101
+ ```bash
102
+ ###### modify L78 to disable fast tokenizer
103
+ default=False,
104
+
105
+ ###### update dataset map part at L313
106
+ dataset = dataset.map(tokenizer, input_columns="text", fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512})
107
+
108
+ ###### add following lines to L213
109
+ del model
110
+ print(f"Manually deleted Teacher model, free some memory for student model.")
111
+
112
+ ###### add following lines to L337
113
+ trainer.push_to_hub()
114
+ tokenizer.push_to_hub("distilbert-base-multilingual-cased-sentiments-student")
115
+
116
+ ```
117
+
118
+ ### Training log
119
+ ```bash
120
+
121
+ Training completed. Do not forget to share your model on huggingface.co/models =)
122
+
123
+ {'train_runtime': 2009.8864, 'train_samples_per_second': 73.0, 'train_steps_per_second': 4.563, 'train_loss': 0.6473459283913797, 'epoch': 1.0}
124
+ 100%|███████████████████████████████████████| 9171/9171 [33:29<00:00, 4.56it/s]
125
+ [INFO|trainer.py:762] 2023-05-06 10:56:18,555 >> The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
126
+ [INFO|trainer.py:3129] 2023-05-06 10:56:18,557 >> ***** Running Evaluation *****
127
+ [INFO|trainer.py:3131] 2023-05-06 10:56:18,557 >> Num examples = 146721
128
+ [INFO|trainer.py:3134] 2023-05-06 10:56:18,557 >> Batch size = 128
129
+ 100%|███████████████████████████████████████| 1147/1147 [08:59<00:00, 2.13it/s]
130
+ 05/06/2023 11:05:18 - INFO - __main__ - Agreement of student and teacher predictions: 88.29%
131
+ [INFO|trainer.py:2868] 2023-05-06 11:05:18,251 >> Saving model checkpoint to ./distilbert-base-multilingual-cased-sentiments-student
132
+ [INFO|configuration_utils.py:457] 2023-05-06 11:05:18,251 >> Configuration saved in ./distilbert-base-multilingual-cased-sentiments-student/config.json
133
+ [INFO|modeling_utils.py:1847] 2023-05-06 11:05:18,905 >> Model weights saved in ./distilbert-base-multilingual-cased-sentiments-student/pytorch_model.bin
134
+ [INFO|tokenization_utils_base.py:2171] 2023-05-06 11:05:18,905 >> tokenizer config file saved in ./distilbert-base-multilingual-cased-sentiments-student/tokenizer_config.json
135
+ [INFO|tokenization_utils_base.py:2178] 2023-05-06 11:05:18,905 >> Special tokens file saved in ./distilbert-base-multilingual-cased-sentiments-student/special_tokens_map.json
136
+
137
+ ```
138
+
139
+ ### Framework versions
140
+
141
+ - Transformers 4.28.1
142
+ - Pytorch 2.0.0+cu118
143
+ - Datasets 2.11.0
144
+ - Tokenizers 0.13.3
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-multilingual-cased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "positive",
13
+ "1": "neutral",
14
+ "2": "negative"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "label2id": {
18
+ "negative": 2,
19
+ "neutral": 1,
20
+ "positive": 0
21
+ },
22
+ "max_position_embeddings": 512,
23
+ "model_type": "distilbert",
24
+ "n_heads": 12,
25
+ "n_layers": 6,
26
+ "output_past": true,
27
+ "pad_token_id": 0,
28
+ "qa_dropout": 0.1,
29
+ "seq_classif_dropout": 0.2,
30
+ "sinusoidal_pos_embds": false,
31
+ "tie_weights_": true,
32
+ "torch_dtype": "float32",
33
+ "transformers_version": "4.28.1",
34
+ "vocab_size": 119547
35
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
3
+ size 0
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:636845a1553fa7b0739eb8ee31f9150acabf87fe4d430f3ce0d999366bf9afcd
3
+ size 335544320
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": false,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 512,
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "DistilBertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0a703ae5fa8faeb9b4595394703a0f1fdd7a4f8d8436acf449c8032d939e519
3
+ size 3643
vocab.txt ADDED
The diff for this file is too large to render. See raw diff