NusaMT-7B / README.md
williamhtan's picture
Add NusaMT dataset
308998b verified
|
raw
history blame
7.83 kB
---
library_name: transformers
tags:
- low resource
- trans
language:
- ban
- min
- en
- id
base_model:
- Yellow-AI-NLP/komodo-7b-base
---
# Model Card for Model ID
NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.
## Model Details
### Model Description
- **Developed by:** William Tan
- **Model type:** Decoder-only Large Language Model
- **Language(s) (NLP):** Balinese, Minangkabau, Indonesian, English
<!-- - **License:** [More Information Needed] -->
- **Finetuned from model:** Yellow-AI-NLP/komodo-7b-base
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/williammtan/nusamt
- **Paper:** https://arxiv.org/abs/2410.07830
- **Demo:** https://indonesiaku.com/translate
## Uses
The model is designed for:
- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
- Language preservation and documentation
- Cross-cultural communication
- Educational purposes and language learning
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
- Integrated into translation applications
- Used for data augmentation in low-resource language tasks
- Adapted for other Indonesian regional languages
- Used as a foundation for developing language learning tools
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model is not suitable for:
- Translation of languages outside its trained scope
- General text generation or chat functionality
- Real-time translation requiring minimal latency
- Critical applications where translation errors could cause harm
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau)
- Performance varies between translation directions, with better results for translations into low-resource languages
- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
- May not capture all dialectal variations or cultural nuances
- Uses significantly more parameters (7 billion) compared to traditional NMT models
- Limited by the quality and quantity of available training data
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
## How to Get Started with the Model
Use the code below to get started with the model.
## Training Details
### Training Data
NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT
Total parallel sentences after cleaning:
- Balinese ↔ English: 35.6k sentences
- Balinese ↔ Indonesian: 44.9k sentences
- Minangkabau ↔ English: 16.6k sentences
- Minangkabau ↔ Indonesian: 22.4k sentences
Data sources:
- NLLB Mined corpus (ODC-BY license)
- NLLB SEED dataset (CC-BY-SA license)
- BASAbaliWiki (CC-BY-SA license)
- Bible verses from Alkitab.mobi (for non-profit scholarly use)
- NusaX dataset (CC-BY-SA license)
#### Preprocessing
- Length filtering (15-500 characters)
- Word length ratio of 2
- Removal of sentences with words >20 characters
- Deduplication
- Language identification with GlotLid V3 (threshold: 0.9)
- LASER3 similarity scoring (threshold: 1.09)
- GPT-4o mini-based data cleaning
#### Training Hyperparameters
- Training regime: bfloat16 mixed precision
- LoRA rank: 16
- Learning rate: 0.002
- Batch size: 10 per device
- Epochs: 3
- Data splits: 90% training, 5% validation, 5% testing
- Loss: Causal Language Modeling (CLM)
<!-- #### Speeds, Sizes, Times [optional] -->
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
<!-- [More Information Needed] -->
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
- FLORES-200 multilingual translation benchmark
- Internal test set (5% of parallel data)
#### Metrics
- spBLEU (SentencePiece tokenized BLEU)
### Results
Performance highlights:
- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation
### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements
| Models | ban β†’ en | en β†’ ban | ban β†’ id | id β†’ ban |
|-------------------------------|----------|----------|----------|----------|
| LLaMA2-7B SFT | 27.63 | 13.94 | 27.90 | 13.68 |
| + Monolingual Pre-training | 31.28 | 18.92 | 28.75 | 20.11 |
| + Mono + Backtranslation | 33.97 | 20.27 | 29.62 | 20.67 |
| + Mono + LLM Cleaner | 33.23 | 19.75 | 29.02 | 21.16 |
| + Mono + Cleaner + Backtrans. | **35.42**| **22.15**| **31.56**| **22.95**|
This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.
### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models
| Models | ban β†’ en | en β†’ ban | ban β†’ id | id β†’ ban | min β†’ en | en β†’ min | min β†’ id | id β†’ min |
|-------------------------------|----------|----------|----------|----------|----------|----------|----------|----------|
| GPT-3.5-turbo, zero-shot | 27.17 | 11.63 | 28.17 | 13.14 | 28.75 | 11.07 | 31.06 | 11.05 |
| GPT-4o, zero-shot | 27.11 | 11.45 | 27.89 | 13.08 | 28.63 | 11.00 | 31.27 | 11.00 |
| GPT-4, zero-shot | 27.20 | 11.59 | 28.41 | 13.24 | 28.51 | 10.99 | 31.00 | 10.93 |
| NLLB-600M | 33.96 | 16.86 | 30.12 | 15.15 | 35.05 | 19.72 | 31.92 | 17.72 |
| NLLB-1.3B | 37.24 | 17.73 | 32.42 | 16.21 | 38.59 | 22.79 | 34.68 | 20.89 |
| NLLB-3.3B | **38.57**| 17.09 | **33.35**| 14.85 | **40.61**| **24.71**| **35.20**| 22.44 |
| NusaMT-7B (Ours) | 35.42 | **22.15**| 31.56 | **22.95**| 37.23 | 24.32 | 34.29 | **23.27**|
This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.
## Environmental Impact
- **Hardware Type:** 2x NVIDIA RTX 4090
- **Hours used:** 1250
- **Cloud Provider:** Runpod.io
- **Carbon Emitted:** 210 kg CO2e
## Citation
If you find this model useful, please cite the following works
```
@misc{tan2024nusamt7bmachinetranslationlowresource,
title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models},
author={William Tan and Kevin Zhu},
year={2024},
eprint={2410.07830},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.07830},
}
```