Update README.md
Browse files
README.md
CHANGED
@@ -1,53 +1,104 @@
|
|
1 |
-
|
2 |
-
tags:
|
3 |
-
- generated_from_trainer
|
4 |
-
model-index:
|
5 |
-
- name: DeB3RTa_3
|
6 |
-
results: []
|
7 |
-
---
|
8 |
|
9 |
-
|
10 |
-
should probably proofread and complete it, then remove this comment. -->
|
11 |
|
12 |
-
|
13 |
|
14 |
-
|
15 |
|
16 |
-
|
|
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
25 |
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
-
## Training
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
-
|
33 |
-
- learning_rate: 0.0001
|
34 |
-
- train_batch_size: 192
|
35 |
-
- eval_batch_size: 8
|
36 |
-
- seed: 42
|
37 |
-
- gradient_accumulation_steps: 8
|
38 |
-
- total_train_batch_size: 1536
|
39 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
|
40 |
-
- lr_scheduler_type: linear
|
41 |
-
- lr_scheduler_warmup_ratio: 0.01
|
42 |
-
- num_epochs: 50.0
|
43 |
|
44 |
-
|
|
|
45 |
|
|
|
|
|
|
|
46 |
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
+
DeB3RTa is a family of transformer-based language models specifically designed for Portuguese financial text processing. These models are built on the DeBERTa-v2 architecture and trained using a comprehensive mixed-domain pretraining strategy that combines financial, political, business management, and accounting corpora.
|
|
|
4 |
|
5 |
+
## Model Variants
|
6 |
|
7 |
+
Two variants are available:
|
8 |
|
9 |
+
- **DeB3RTa-base**: 12 attention heads, 12 layers, intermediate size of 3072, hidden size of 768 (~426M parameters)
|
10 |
+
- **DeB3RTa-small**: 6 attention heads, 12 layers, intermediate size of 1536, hidden size of 384 (~70M parameters)
|
11 |
|
12 |
+
## Key Features
|
13 |
|
14 |
+
- First Portuguese financial domain-specific transformer model
|
15 |
+
- Mixed-domain pretraining incorporating finance, politics, business, and accounting texts
|
16 |
+
- Enhanced performance on financial NLP tasks compared to general-domain models
|
17 |
+
- Resource-efficient architecture with strong performance-to-parameter ratio
|
18 |
+
- Advanced fine-tuning techniques including layer reinitialization, mixout regularization, and layer-wise learning rate decay
|
19 |
|
20 |
+
## Performance
|
21 |
|
22 |
+
The models have been evaluated on multiple financial domain tasks:
|
23 |
|
24 |
+
| Task | Dataset | DeB3RTa-base F1 | DeB3RTa-small F1 |
|
25 |
+
|------|----------|-----------------|------------------|
|
26 |
+
| Fake News Detection | FAKE.BR | 0.9906 | 0.9598 |
|
27 |
+
| Sentiment Analysis | CAROSIA | 0.9207 | 0.8722 |
|
28 |
+
| Regulatory Classification | BBRC | 0.7609 | 0.6712 |
|
29 |
+
| Hate Speech Detection | OFFCOMBR-3 | 0.7539 | 0.5460 |
|
30 |
|
31 |
+
## Training Data
|
32 |
|
33 |
+
The models were trained on a diverse corpus of 1.05 billion tokens, including:
|
34 |
+
- Financial market relevant facts (2003-2023)
|
35 |
+
- Financial patents (2006-2021)
|
36 |
+
- Research articles from Brazilian Scielo
|
37 |
+
- Financial news articles (1999-2023)
|
38 |
+
- Wikipedia articles in Portuguese
|
39 |
|
40 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
+
```python
|
43 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
44 |
|
45 |
+
# Load model and tokenizer
|
46 |
+
model = AutoModelForMaskedLM.from_pretrained("higopires/DeB3RTa-[base/small]")
|
47 |
+
tokenizer = AutoTokenizer.from_pretrained("higopires/DeB3RTa-[base/small]")
|
48 |
|
49 |
+
# Example usage
|
50 |
+
text = "O mercado financeiro brasileiro apresentou [MASK] no último trimestre."
|
51 |
+
inputs = tokenizer(text, return_tensors="pt")
|
52 |
+
outputs = model(**inputs)
|
53 |
+
```
|
54 |
|
55 |
+
## Citations
|
56 |
|
57 |
+
If you use this model in your research, please cite:
|
58 |
+
|
59 |
+
```bibtex
|
60 |
+
@article{pires2025deb3rta,
|
61 |
+
title={DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain},
|
62 |
+
author={Pires, Higo and Paucar, Leonardo and Carvalho, Joao Paulo},
|
63 |
+
journal={Big Data and Cognitive Computing},
|
64 |
+
year={2025},
|
65 |
+
volume={1},
|
66 |
+
number={0},
|
67 |
+
publisher={MDPI}
|
68 |
+
}
|
69 |
+
```
|
70 |
+
|
71 |
+
## Limitations
|
72 |
+
|
73 |
+
- Performance degradation on the smaller variant, particularly for hate speech detection
|
74 |
+
- May require task-specific fine-tuning for optimal performance
|
75 |
+
- Limited evaluation on multilingual financial tasks
|
76 |
+
- Model behavior on very long documents (>128 tokens) not extensively tested
|
77 |
+
|
78 |
+
## License
|
79 |
+
|
80 |
+
MIT License
|
81 |
+
|
82 |
+
Copyright (c) 2025 Higo Pires
|
83 |
+
|
84 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
85 |
+
of this software and associated documentation files (the "Software"), to deal
|
86 |
+
in the Software without restriction, including without limitation the rights
|
87 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
88 |
+
copies of the Software, and to permit persons to whom the Software is
|
89 |
+
furnished to do so, subject to the following conditions:
|
90 |
+
|
91 |
+
The above copyright notice and this permission notice shall be included in all
|
92 |
+
copies or substantial portions of the Software.
|
93 |
+
|
94 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
95 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
96 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
97 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
98 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
99 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
100 |
+
SOFTWARE.
|
101 |
+
|
102 |
+
## Acknowledgments
|
103 |
+
|
104 |
+
This work was supported by the Instituto Federal de Educação, Ciência e Tecnologia do Maranhão and the Human Language Technology Lab in Instituto de Engenharia de Sistemas e Computadores—Investigação e Desenvolvimento (INESC-ID).
|