Ngit commited on
Commit
809757b
·
1 Parent(s): edeaa44

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+
6
+ # Text Classification GoEmotions
7
+
8
+ This model is a fined-tuned version of [nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large) on the on the [Jigsaw 1st Kaggle competition](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge) dataset using [unitary/toxic-bert](https://huggingface.co/unitary/toxic-bert) as teacher model.
9
+
10
+ # Load the Model
11
+
12
+ ```py
13
+ import os
14
+ import numpy as np
15
+ import json
16
+
17
+ from tokenizers import Tokenizer
18
+ from onnxruntime import InferenceSession
19
+
20
+
21
+ # !git clone https://huggingface.co/Ngit/MiniLM-L6-toxic-all-labels
22
+
23
+ model_name = "Ngit/MiniLM-L6-toxic-all-labels"
24
+ tokenizer = Tokenizer.from_pretrained(model_name)
25
+ tokenizer.enable_padding(
26
+ pad_token="<pad>",
27
+ pad_id=1,
28
+ )
29
+ tokenizer.enable_truncation(max_length=256)
30
+ batch_size = 16
31
+
32
+ texts = ["This is pure trash",]
33
+ outputs = []
34
+ model = InferenceSession("MiniLM-L6-toxic-all-labels-onnx/model_optimized.onnx", providers=['CUDAExecutionProvider'])
35
+
36
+ with open(os.path.join("MiniLM-L6-toxic-all-labels-onnx", "config.json"), "r") as f:
37
+ config = json.load(f)
38
+
39
+ output_names = [output.name for output in model.get_outputs()]
40
+ input_names = [input.name for input in model.get_inputs()]
41
+
42
+ for subtexts in np.array_split(np.array(texts), len(texts) // batch_size + 1):
43
+ encodings = tokenizer.encode_batch(list(subtexts))
44
+ inputs = {
45
+ "input_ids": np.vstack(
46
+ [encoding.ids for encoding in encodings], dtype=np.int64
47
+ ),
48
+ "attention_mask": np.vstack(
49
+ [encoding.attention_mask for encoding in encodings], dtype=np.int64
50
+ ),
51
+ "token_type_ids": np.vstack(
52
+ [encoding.type_ids for encoding in encodings], dtype=np.int64
53
+ ),
54
+ }
55
+
56
+ for input_name in input_names:
57
+ if input_name not in inputs:
58
+ raise ValueError(f"Input name {input_name} not found in inputs")
59
+
60
+ inputs = {input_name: inputs[input_name] for input_name in input_names}
61
+ output = np.squeeze(
62
+ np.stack(
63
+ model.run(output_names=output_names, input_feed=inputs)
64
+ ),
65
+ axis=0,
66
+ )
67
+ outputs.append(output)
68
+
69
+ outputs = np.concatenate(outputs, axis=0)
70
+ scores = 1 / (1 + np.exp(-outputs))
71
+ results = []
72
+ for item in scores:
73
+ labels = []
74
+ scores = []
75
+ for idx, s in enumerate(item):
76
+ labels.append(config["id2label"][str(idx)])
77
+ scores.append(float(s))
78
+ results.append({"labels": labels, "scores": scores})
79
+
80
+ results
81
+ ```
82
+
83
+ # Training hyperparameters
84
+
85
+ The following hyperparameters were used during training:
86
+ - learning_rate: 6e-05
87
+ - train_batch_size: 48
88
+ - eval_batch_size: 48
89
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
90
+ - lr_scheduler_type: linear
91
+ - num_epochs: 10
92
+ - warmup_ratio: 0.1
93
+
94
+
95
+ # Metrics (comparison with teacher model)
96
+
97
+ | Teacher (params) | Student (params) | Set (metric) | Score (teacher) | Score (student) |
98
+ |--------------------|-------------|----------|--------| --------|
99
+ | unitary/toxic-bert (110M) | MiniLMv2-L6-H384-goemotions-v2-onnx (23M) | Test (ROC_AUC) | 0.98636 | 0.98600 |
100
+
101
+ # Training Code, Evaluation & Deployment
102
+
103
+ Check
104
+