|
--- |
|
library_name: transformers |
|
tags: |
|
- low resource |
|
- trans |
|
language: |
|
- ban |
|
- min |
|
- en |
|
- id |
|
base_model: |
|
- Yellow-AI-NLP/komodo-7b-base |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** William Tan |
|
- **Model type:** Decoder-only Large Language Model |
|
- **Language(s) (NLP):** Balinese, Minangkabau, Indonesian, English |
|
<!-- - **License:** [More Information Needed] --> |
|
- **Finetuned from model:** Yellow-AI-NLP/komodo-7b-base |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/williammtan/nusamt |
|
- **Paper:** https://arxiv.org/abs/2410.07830 |
|
- **Demo:** https://indonesiaku.com/translate |
|
|
|
## Uses |
|
|
|
The model is designed for: |
|
- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau) |
|
- Language preservation and documentation |
|
- Cross-cultural communication |
|
- Educational purposes and language learning |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
- Integrated into translation applications |
|
- Used for data augmentation in low-resource language tasks |
|
- Adapted for other Indonesian regional languages |
|
- Used as a foundation for developing language learning tools |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
The model is not suitable for: |
|
|
|
- Translation of languages outside its trained scope |
|
- General text generation or chat functionality |
|
- Real-time translation requiring minimal latency |
|
- Critical applications where translation errors could cause harm |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
- Limited to specific language pairs (English/Indonesian β Balinese/Minangkabau) |
|
- Performance varies between translation directions, with better results for translations into low-resource languages |
|
- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages |
|
- May not capture all dialectal variations or cultural nuances |
|
- Uses significantly more parameters (7 billion) compared to traditional NMT models |
|
- Limited by the quality and quantity of available training data |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT |
|
|
|
Total parallel sentences after cleaning: |
|
- Balinese β English: 35.6k sentences |
|
- Balinese β Indonesian: 44.9k sentences |
|
- Minangkabau β English: 16.6k sentences |
|
- Minangkabau β Indonesian: 22.4k sentences |
|
|
|
Data sources: |
|
- NLLB Mined corpus (ODC-BY license) |
|
- NLLB SEED dataset (CC-BY-SA license) |
|
- BASAbaliWiki (CC-BY-SA license) |
|
- Bible verses from Alkitab.mobi (for non-profit scholarly use) |
|
- NusaX dataset (CC-BY-SA license) |
|
|
|
#### Preprocessing |
|
|
|
- Length filtering (15-500 characters) |
|
- Word length ratio of 2 |
|
- Removal of sentences with words >20 characters |
|
- Deduplication |
|
- Language identification with GlotLid V3 (threshold: 0.9) |
|
- LASER3 similarity scoring (threshold: 1.09) |
|
- GPT-4o mini-based data cleaning |
|
|
|
#### Training Hyperparameters |
|
|
|
- Training regime: bfloat16 mixed precision |
|
- LoRA rank: 16 |
|
- Learning rate: 0.002 |
|
- Batch size: 10 per device |
|
- Epochs: 3 |
|
- Data splits: 90% training, 5% validation, 5% testing |
|
- Loss: Causal Language Modeling (CLM) |
|
|
|
|
|
<!-- #### Speeds, Sizes, Times [optional] --> |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
<!-- [More Information Needed] --> |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
- FLORES-200 multilingual translation benchmark |
|
- Internal test set (5% of parallel data) |
|
|
|
|
|
#### Metrics |
|
|
|
- spBLEU (SentencePiece tokenized BLEU) |
|
|
|
### Results |
|
|
|
Performance highlights: |
|
- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese |
|
- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages |
|
- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation |
|
|
|
### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements |
|
|
|
| Models | ban β en | en β ban | ban β id | id β ban | |
|
|-------------------------------|----------|----------|----------|----------| |
|
| LLaMA2-7B SFT | 27.63 | 13.94 | 27.90 | 13.68 | |
|
| + Monolingual Pre-training | 31.28 | 18.92 | 28.75 | 20.11 | |
|
| + Mono + Backtranslation | 33.97 | 20.27 | 29.62 | 20.67 | |
|
| + Mono + LLM Cleaner | 33.23 | 19.75 | 29.02 | 21.16 | |
|
| + Mono + Cleaner + Backtrans. | **35.42**| **22.15**| **31.56**| **22.95**| |
|
|
|
This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs. |
|
|
|
### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models |
|
|
|
| Models | ban β en | en β ban | ban β id | id β ban | min β en | en β min | min β id | id β min | |
|
|-------------------------------|----------|----------|----------|----------|----------|----------|----------|----------| |
|
| GPT-3.5-turbo, zero-shot | 27.17 | 11.63 | 28.17 | 13.14 | 28.75 | 11.07 | 31.06 | 11.05 | |
|
| GPT-4o, zero-shot | 27.11 | 11.45 | 27.89 | 13.08 | 28.63 | 11.00 | 31.27 | 11.00 | |
|
| GPT-4, zero-shot | 27.20 | 11.59 | 28.41 | 13.24 | 28.51 | 10.99 | 31.00 | 10.93 | |
|
| NLLB-600M | 33.96 | 16.86 | 30.12 | 15.15 | 35.05 | 19.72 | 31.92 | 17.72 | |
|
| NLLB-1.3B | 37.24 | 17.73 | 32.42 | 16.21 | 38.59 | 22.79 | 34.68 | 20.89 | |
|
| NLLB-3.3B | **38.57**| 17.09 | **33.35**| 14.85 | **40.61**| **24.71**| **35.20**| 22.44 | |
|
| NusaMT-7B (Ours) | 35.42 | **22.15**| 31.56 | **22.95**| 37.23 | 24.32 | 34.29 | **23.27**| |
|
|
|
This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages. |
|
|
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** 2x NVIDIA RTX 4090 |
|
- **Hours used:** 1250 |
|
- **Cloud Provider:** Runpod.io |
|
- **Carbon Emitted:** 210 kg CO2e |
|
|
|
|
|
## Citation |
|
|
|
If you find this model useful, please cite the following works |
|
|
|
``` |
|
@misc{tan2024nusamt7bmachinetranslationlowresource, |
|
title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models}, |
|
author={William Tan and Kevin Zhu}, |
|
year={2024}, |
|
eprint={2410.07830}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2410.07830}, |
|
} |
|
``` |
|
|
|
|