MikhailVyrodov commited on
Commit
fde2cd5
·
verified ·
1 Parent(s): af44334

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -3
README.md CHANGED
@@ -1,3 +1,117 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - deepvk/synthetic-classes
5
+ language:
6
+ - ru
7
+ base_model:
8
+ - deepvk/USER2-base
9
+ pipeline_tag: zero-shot-classification
10
+ ---
11
+
12
+ # GeRaCl-USER2-base
13
+
14
+ **GeRaCl** is a **Ge**neral **Ra**pid **Cl**assifer designed to perform zero-shot classification tasks primarily on Russian texts.
15
+
16
+
17
+ This is a model with 155M parameters that is build on top of the [USER2-base](https://huggingface.co/deepvk/USER2-base) sentence encoder (149M) and is fine-tuned for zero-shot classification task.
18
+
19
+ ## Performance
20
+
21
+ To evaluate the model, we measure quality on multiclass classification tasks from the `MTEB-rus` benchmark.
22
+
23
+ **MTEB-rus**
24
+
25
+ | Model | Size | Hidden Dim | Context Length | Mean(task) | Kinopoisk | Headlines | GRNTI | OECD | Inappropriateness |
26
+ |----------------------------------------------------------------------------------------------:|:-----:|:----------:|:--------------:|:----------:|:--------------:|:-------------:|:----------:|:------------------------:|:-----------------:|
27
+ | `GeRaCl-USER2-base` | 155M | 768 | 8192 | 0.65 | 0.61 | 0.80 | 0.63 | 0.48 | 0.71 |
28
+ | `USER2-base` | 149M | 768 | 8192 | 0.52 | 0.50 | 0.65 | 0.56 | 0.39 | 0.51 |
29
+ | `USER-bge-m3` | 359M | 1024 | 8192 | 0.53 | 0.60 | 0.73 | 0.43 | 0.28 | 0.62 |
30
+ | `multilingual-e5-large-instruct` | 560M | 1024 | 512 | 0.63 | 0.56 | 0.83 | 0.62 | 0.46 | 0.67 |
31
+ | `mDeBERTa-v3-base-mnli-xnli` | 279M | 768 | 512 | 0.45 | 0.54 | 0.53 | 0.34 | 0.23 | 0.62 |
32
+ | `bge-m3-zeroshot-v2.0` | 568M | 1024 | 8192 | 0.60 | 0.65 | 0.72 | 0.53 | 0.41 | 0.67 |
33
+ | `Qwen2.5-1.5B-Instruct` | 1,5B | 1536 | 128K | 0.56 | 0.62 | 0.55 | 0.51 | 0.41 | 0.71 |
34
+ | `Qwen2.5-3B-Instruct` | 3B | 2048 | 128K | 0.63 | 0.63 | 0.74 | 0.60 | 0.43 | 0.75 |
35
+
36
+ ## Usage
37
+
38
+ ### Prefixes
39
+
40
+ This model is based on the USER2-base sentence encoder. This model uses the "classification: " prefix to work on classification tasks.
41
+
42
+ ### Code
43
+
44
+ #### Single classification scenario
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer
48
+ from geracl import GeraclHF, ZeroShotClassificationPipeline
49
+
50
+ model = GeraclHF.from_pretrained('deepvk/GeRaCl-USER2-base').to('cuda').eval()
51
+ tokenizer = AutoTokenizer.from_pretrained('deepvk/GeRaCl-USER2-base')
52
+
53
+ pipe = ZeroShotClassificationPipeline(model, tokenizer, device="cuda")
54
+
55
+ text = "Утилизация катализаторов: как неплохо заработать"
56
+ labels = ["экономика", "происшествия", "политика", "культура", "наука", "спорт"]
57
+ result = pipe(text, labels, batch_size=1)[0]
58
+
59
+ print(labels[result])
60
+ ```
61
+
62
+ #### Multiple classification scenarios
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer
66
+ from geracl import GeraclHF, ZeroShotClassificationPipeline
67
+
68
+ model = GeraclHF.from_pretrained('deepvk/GeRaCl-USER2-base').to('cuda').eval()
69
+ tokenizer = AutoTokenizer.from_pretrained('deepvk/GeRaCl-USER2-base')
70
+
71
+ pipe = ZeroShotClassificationPipeline(model, tokenizer, device="cuda")
72
+
73
+ texts = [
74
+ "Утилизация катализаторов: как неплохо заработать",
75
+ "Мне не понравился этот фильм"
76
+ ]
77
+ labels = [
78
+ ["экономика", "происшествия", "политика", "культура", "наука", "спорт"],
79
+ ["нейтральный", "позитивный", "негативный"]
80
+ ]
81
+ results = pipe(texts, labels, batch_size=2)
82
+
83
+ for i in range(len(labels)):
84
+ print(labels[i][results[i]])
85
+ ```
86
+
87
+ ## Training details
88
+
89
+ This is the base version with 155 million parameters, based on [`USER2-base`](https://huggingface.co/deepvk/USER2-base) sentence encoder. This model uses the GLiNER architecture, but it has only one vector of similarity scores instead of a full matrix of similarities.
90
+ Compared to the USER2-base model, there are two additional MLP layers. One is for the text embeddings and another is for the classes embeddings. You can see the detailed model's architecture on the picture below.
91
+
92
+ <img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
93
+
94
+ The training set is built entirely from splits of the [`deepvk/CLAZER`](https://huggingface.co/datasets/deepvk/synthetic-classes) dataset. It is a concatenation of three sub-datasets:
95
+ - **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
96
+ - **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
97
+ - **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
98
+
99
+
100
+ | Dataset | # Samples |
101
+ |----------------------------:|:----:|
102
+ | [CLAZER/synthetic_classes_train](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/synthetic_classes_train) | 93K |
103
+ | [CLAZER/synthetic_classes](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/synthetic_classes) (val and test) | 6K |
104
+ | [CLAZER/ru_mteb_classes](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/ru_mteb_classes/) | 52K |
105
+ | [CLAZER/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/ru_mteb_extended_classes) | 93K |
106
+ | **Total** | 244K |
107
+
108
+ ## Citations
109
+ ```
110
+ @misc{deepvk2025geracl,
111
+ title={GeRaCl},
112
+ author={Vyrodov, Mikhail and Spirin, Egor and Sokolov Andrey},
113
+ url={https://huggingface.co/deepvk/GeRaCl-USER2-base},
114
+ publisher={Hugging Face}
115
+ year={2025},
116
+ }
117
+ ```