|
--- |
|
pt: pt-br |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- portuguese |
|
- financial |
|
- bert |
|
- deberta |
|
- nlp |
|
- fill-mask |
|
- masked-lm |
|
datasets: |
|
- FAKE.BR |
|
- CAROSIA |
|
- BBRC |
|
- OFFCOMBR-3 |
|
metrics: |
|
- f1 |
|
- precision |
|
- recall |
|
- pr_auc |
|
model-index: |
|
- name: DeB3RTa-base |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Fake News Detection |
|
dataset: |
|
type: FAKE.BR |
|
name: FAKE.BR |
|
metrics: |
|
- type: f1 |
|
value: 0.9906 |
|
|
|
- task: |
|
type: text-classification |
|
name: Sentiment Analysis |
|
dataset: |
|
type: CAROSIA |
|
name: CAROSIA |
|
metrics: |
|
- type: f1 |
|
value: 0.9207 |
|
|
|
- task: |
|
type: text-classification |
|
name: Regulatory Classification |
|
dataset: |
|
type: BBRC |
|
name: BBRC |
|
metrics: |
|
- type: f1 |
|
value: 0.7609 |
|
|
|
- task: |
|
type: text-classification |
|
name: Hate Speech Detection |
|
dataset: |
|
type: OFFCOMBR-3 |
|
name: OFFCOMBR-3 |
|
metrics: |
|
- type: f1 |
|
value: 0.7539 |
|
|
|
inference: true |
|
--- |
|
|
|
# DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain |
|
|
|
DeB3RTa is a family of transformer-based language models specifically designed for Portuguese financial text processing. These models are built on the DeBERTa-v2 architecture and trained using a comprehensive mixed-domain pretraining strategy that combines financial, political, business management, and accounting corpora. |
|
|
|
## Model Variants |
|
|
|
Two variants are available: |
|
|
|
- **DeB3RTa-base**: 12 attention heads, 12 layers, intermediate size of 3072, hidden size of 768 (~426M parameters) |
|
- **DeB3RTa-small**: 6 attention heads, 12 layers, intermediate size of 1536, hidden size of 384 (~70M parameters) |
|
|
|
## Key Features |
|
|
|
- First Portuguese financial domain-specific transformer model |
|
- Mixed-domain pretraining incorporating finance, politics, business, and accounting texts |
|
- Enhanced performance on financial NLP tasks compared to general-domain models |
|
- Resource-efficient architecture with strong performance-to-parameter ratio |
|
- Advanced fine-tuning techniques including layer reinitialization, mixout regularization, and layer-wise learning rate decay |
|
|
|
## Performance |
|
|
|
The models have been evaluated on multiple financial domain tasks: |
|
|
|
| Task | Dataset | DeB3RTa-base F1 | DeB3RTa-small F1 | |
|
|------|----------|-----------------|------------------| |
|
| Fake News Detection | FAKE.BR | 0.9906 | 0.9598 | |
|
| Sentiment Analysis | CAROSIA | 0.9207 | 0.8722 | |
|
| Regulatory Classification | BBRC | 0.7609 | 0.6712 | |
|
| Hate Speech Detection | OFFCOMBR-3 | 0.7539 | 0.5460 | |
|
|
|
## Training Data |
|
|
|
The models were trained on a diverse corpus of 1.05 billion tokens, including: |
|
- Financial market relevant facts (2003-2023) |
|
- Financial patents (2006-2021) |
|
- Research articles from Brazilian Scielo |
|
- Financial news articles (1999-2023) |
|
- Wikipedia articles in Portuguese |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForMaskedLM.from_pretrained("higopires/DeB3RTa-[base/small]") |
|
tokenizer = AutoTokenizer.from_pretrained("higopires/DeB3RTa-[base/small]") |
|
|
|
# Example usage |
|
text = "O mercado financeiro brasileiro apresentou [MASK] no último trimestre." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
``` |
|
|
|
## Citations |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@article{pires2025deb3rta, |
|
AUTHOR = {Pires, Higo and Paucar, Leonardo and Carvalho, Joao Paulo}, |
|
TITLE = {DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain}, |
|
JOURNAL = {Big Data and Cognitive Computing}, |
|
VOLUME = {9}, |
|
YEAR = {2025}, |
|
NUMBER = {3}, |
|
ARTICLE-NUMBER = {51}, |
|
URL = {https://www.mdpi.com/2504-2289/9/3/51}, |
|
ISSN = {2504-2289}, |
|
ABSTRACT = {The complex and specialized terminology of financial language in Portuguese-speaking markets create significant challenges for natural language processing (NLP) applications, which must capture nuanced linguistic and contextual information to support accurate analysis and decision-making. This paper presents DeB3RTa, a transformer-based model specifically developed through a mixed-domain pretraining strategy that combines extensive corpora from finance, politics, business management, and accounting to enable a nuanced understanding of financial language. DeB3RTa was evaluated against prominent models—including BERTimbau, XLM-RoBERTa, SEC-BERT, BusinessBERT, and GPT-based variants—and consistently achieved significant gains across key financial NLP benchmarks. To maximize adaptability and accuracy, DeB3RTa integrates advanced fine-tuning techniques such as layer reinitialization, mixout regularization, stochastic weight averaging, and layer-wise learning rate decay, which together enhance its performance across varied and high-stakes NLP tasks. These findings underscore the efficacy of mixed-domain pretraining in building high-performance language models for specialized applications. With its robust performance in complex analytical and classification tasks, DeB3RTa offers a powerful tool for advancing NLP in the financial sector and supporting nuanced language processing needs in Portuguese-speaking contexts.}, |
|
DOI = {10.3390/bdcc9030051} |
|
} |
|
``` |
|
|
|
## Limitations |
|
|
|
- Performance degradation on the smaller variant, particularly for hate speech detection |
|
- May require task-specific fine-tuning for optimal performance |
|
- Limited evaluation on multilingual financial tasks |
|
- Model behavior on very long documents (>128 tokens) not extensively tested |
|
|
|
## License |
|
|
|
MIT License |
|
|
|
Copyright (c) 2025 Higo Pires |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
|
|
## Acknowledgments |
|
|
|
This work was supported by the Instituto Federal de Educação, Ciência e Tecnologia do Maranhão and the Human Language Technology Lab in Instituto de Engenharia de Sistemas e Computadores—Investigação e Desenvolvimento (INESC-ID). |