SamLowe commited on
Commit
a9958e8
·
1 Parent(s): 85ef073

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md CHANGED
@@ -1,3 +1,121 @@
1
  ---
 
 
 
 
 
 
 
 
 
2
  license: mit
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - text-classification
5
+ - onnx
6
+ - emotions
7
+ - multi-class-classification
8
+ - multi-label-classification
9
+ datasets:
10
+ - go_emotions
11
  license: mit
12
+ inference: false
13
+ widget:
14
+ - text: ONNX is so much faster, its very handy!
15
  ---
16
+
17
+ ### Overview
18
+
19
+ This is a multi-label, multi-class linear classifer for emotions that works with [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2), having been trained on the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset.
20
+
21
+ ### Labels
22
+
23
+ The 28 labels from the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset are:
24
+ ```
25
+ ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
26
+ ```
27
+
28
+ ### Metrics (exact match of labels per item)
29
+
30
+ This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification. Evaluating across all labels per item in the go_emotions test split the metrics are shown below.
31
+
32
+ Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:
33
+
34
+ - Precision: 0.378
35
+ - Recall: 0.438
36
+ - F1: 0.394
37
+
38
+ Weighted by the relative support of each label in the dataset, this is:
39
+
40
+ - Precision: 0.424
41
+ - Recall: 0.590
42
+ - F1: 0.481
43
+
44
+ Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label, the metrics (evaluated on the go_emotions test split, and unweighted by support) are:
45
+
46
+ - Precision: 0.568
47
+ - Recall: 0.214
48
+ - F1: 0.260
49
+
50
+ ### Metrics (per-label)
51
+
52
+ This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification and metrics are better measured per label.
53
+
54
+ Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:
55
+
56
+ | | f1 | precision | recall | support | threshold |
57
+ | -------------- | ----- | --------- | ------ | ------- | --------- |
58
+ | admiration | 0.540 | 0.463 | 0.649 | 504 | 0.20 |
59
+ | amusement | 0.686 | 0.669 | 0.705 | 264 | 0.25 |
60
+ | anger | 0.419 | 0.373 | 0.480 | 198 | 0.15 |
61
+ | annoyance | 0.276 | 0.189 | 0.512 | 320 | 0.10 |
62
+ | approval | 0.299 | 0.260 | 0.350 | 351 | 0.15 |
63
+ | caring | 0.303 | 0.219 | 0.489 | 135 | 0.10 |
64
+ | confusion | 0.284 | 0.269 | 0.301 | 153 | 0.15 |
65
+ | curiosity | 0.365 | 0.310 | 0.444 | 284 | 0.15 |
66
+ | desire | 0.274 | 0.237 | 0.325 | 83 | 0.15 |
67
+ | disappointment | 0.188 | 0.292 | 0.139 | 151 | 0.20 |
68
+ | disapproval | 0.305 | 0.257 | 0.375 | 267 | 0.15 |
69
+ | disgust | 0.450 | 0.462 | 0.439 | 123 | 0.20 |
70
+ | embarrassment | 0.348 | 0.375 | 0.324 | 37 | 0.30 |
71
+ | excitement | 0.313 | 0.306 | 0.320 | 103 | 0.20 |
72
+ | fear | 0.550 | 0.505 | 0.603 | 78 | 0.25 |
73
+ | gratitude | 0.776 | 0.774 | 0.778 | 352 | 0.30 |
74
+ | grief | 0.353 | 0.273 | 0.500 | 6 | 0.70 |
75
+ | joy | 0.370 | 0.361 | 0.379 | 161 | 0.20 |
76
+ | love | 0.626 | 0.717 | 0.555 | 238 | 0.35 |
77
+ | nervousness | 0.308 | 0.276 | 0.348 | 23 | 0.55 |
78
+ | optimism | 0.436 | 0.432 | 0.441 | 186 | 0.20 |
79
+ | pride | 0.444 | 0.545 | 0.375 | 16 | 0.60 |
80
+ | realization | 0.171 | 0.146 | 0.207 | 145 | 0.10 |
81
+ | relief | 0.133 | 0.250 | 0.091 | 11 | 0.60 |
82
+ | remorse | 0.468 | 0.426 | 0.518 | 56 | 0.30 |
83
+ | sadness | 0.413 | 0.409 | 0.417 | 156 | 0.20 |
84
+ | surprise | 0.314 | 0.303 | 0.326 | 141 | 0.15 |
85
+ | neutral | 0.622 | 0.482 | 0.879 | 1787 | 0.25 |
86
+
87
+ ### Use with ONNXRuntime
88
+
89
+ The input to the model is called `logits`, and there is one output per label. Each output produces a 2d array, with 1 row per input row, and each row having 2 columns - the first being a proba output for the negative case, and the second being a proba output for the positive case.
90
+
91
+ ```python
92
+ # Assuming you have embeddings from all-MiniLM-L12-v2 for the input sentences
93
+ # E.g. produced from sentence-transformers such as:
94
+ # huggingface.co/sentence-transformers/all-MiniLM-L12-v2
95
+ # or from an ONNX version E.g. huggingface.co/Xenova/all-MiniLM-L12-v2
96
+
97
+ print(sentences.shape) # E.g. a batch of 1 sentence
98
+ > (1, 384)
99
+
100
+ import onnxruntime as ort
101
+
102
+ sess = ort.InferenceSession("path_to_model_dot_onnx", providers=['CPUExecutionProvider'])
103
+
104
+ outputs = [o.name for o in sess.get_outputs()] # list of labels, in the order of the outputs
105
+ preds_onnx = sess.run(_outputs, {'logits': _label_embeddings})
106
+ # preds_onnx is a list with 28 entries, one per label,
107
+ # each with a numpy array of shape (1, 2) given the input was a batch of 1
108
+
109
+ print(outputs[0])
110
+ > surprise
111
+ print(preds_onnx[0])
112
+ > array([[0.97136074, 0.02863926]], dtype=float32)
113
+ ```
114
+
115
+ ### Commentary on the dataset
116
+
117
+ Some labels (E.g. gratitude) when considered independently perform very strongly, whilst others (E.g. relief) perform very poorly.
118
+
119
+ This is a challenging dataset. Labels such as relief do have much fewer examples in the training data (less than 100 out of the 40k+, and only 11 in the test split).
120
+
121
+ But there is also some ambiguity and/or labelling errors visible in the training data of go_emotions that is suspected to constrain the performance. Data cleaning on the dataset to reduce some of the mistakes, ambiguity, conflicts and duplication in the labelling would produce a higher performing model.