NusaMT-7B / README.md

Add NusaMT dataset

308998b verified 4 months ago

7.83 kB

	---
	library_name: transformers
	tags:
	- low resource
	- trans
	language:
	- ban
	- min
	- en
	- id
	base_model:
	- Yellow-AI-NLP/komodo-7b-base
	---

	# Model Card for Model ID

	NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.


	## Model Details

	### Model Description


	- Developed by: William Tan
	- Model type: Decoder-only Large Language Model
	- Language(s) (NLP): Balinese, Minangkabau, Indonesian, English
	<!-- - License: [More Information Needed] -->
	- Finetuned from model: Yellow-AI-NLP/komodo-7b-base

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/williammtan/nusamt
	- Paper: https://arxiv.org/abs/2410.07830
	- Demo: https://indonesiaku.com/translate

	## Uses

	The model is designed for:
	- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
	- Language preservation and documentation
	- Cross-cultural communication
	- Educational purposes and language learning

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	- Integrated into translation applications
	- Used for data augmentation in low-resource language tasks
	- Adapted for other Indonesian regional languages
	- Used as a foundation for developing language learning tools


	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	The model is not suitable for:

	- Translation of languages outside its trained scope
	- General text generation or chat functionality
	- Real-time translation requiring minimal latency
	- Critical applications where translation errors could cause harm

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	- Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau)
	- Performance varies between translation directions, with better results for translations into low-resource languages
	- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
	- May not capture all dialectal variations or cultural nuances
	- Uses significantly more parameters (7 billion) compared to traditional NMT models
	- Limited by the quality and quantity of available training data

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

	## How to Get Started with the Model

	Use the code below to get started with the model.


	## Training Details

	### Training Data

	NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT

	Total parallel sentences after cleaning:
	- Balinese ↔ English: 35.6k sentences
	- Balinese ↔ Indonesian: 44.9k sentences
	- Minangkabau ↔ English: 16.6k sentences
	- Minangkabau ↔ Indonesian: 22.4k sentences

	Data sources:
	- NLLB Mined corpus (ODC-BY license)
	- NLLB SEED dataset (CC-BY-SA license)
	- BASAbaliWiki (CC-BY-SA license)
	- Bible verses from Alkitab.mobi (for non-profit scholarly use)
	- NusaX dataset (CC-BY-SA license)

	#### Preprocessing

	- Length filtering (15-500 characters)
	- Word length ratio of 2
	- Removal of sentences with words >20 characters
	- Deduplication
	- Language identification with GlotLid V3 (threshold: 0.9)
	- LASER3 similarity scoring (threshold: 1.09)
	- GPT-4o mini-based data cleaning

	#### Training Hyperparameters

	- Training regime: bfloat16 mixed precision
	- LoRA rank: 16
	- Learning rate: 0.002
	- Batch size: 10 per device
	- Epochs: 3
	- Data splits: 90% training, 5% validation, 5% testing
	- Loss: Causal Language Modeling (CLM)


	<!-- #### Speeds, Sizes, Times [optional] -->

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	<!-- [More Information Needed] -->

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	- FLORES-200 multilingual translation benchmark
	- Internal test set (5% of parallel data)


	#### Metrics

	- spBLEU (SentencePiece tokenized BLEU)

	### Results

	Performance highlights:
	- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
	- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
	- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation

	### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements

	\| Models \| ban → en \| en → ban \| ban → id \| id → ban \|
	\|-------------------------------\|----------\|----------\|----------\|----------\|
	\| LLaMA2-7B SFT \| 27.63 \| 13.94 \| 27.90 \| 13.68 \|
	\| + Monolingual Pre-training \| 31.28 \| 18.92 \| 28.75 \| 20.11 \|
	\| + Mono + Backtranslation \| 33.97 \| 20.27 \| 29.62 \| 20.67 \|
	\| + Mono + LLM Cleaner \| 33.23 \| 19.75 \| 29.02 \| 21.16 \|
	\| + Mono + Cleaner + Backtrans. \| 35.42\| 22.15\| 31.56\| 22.95\|

	This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.

	### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models

	\| Models \| ban → en \| en → ban \| ban → id \| id → ban \| min → en \| en → min \| min → id \| id → min \|
	\|-------------------------------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|
	\| GPT-3.5-turbo, zero-shot \| 27.17 \| 11.63 \| 28.17 \| 13.14 \| 28.75 \| 11.07 \| 31.06 \| 11.05 \|
	\| GPT-4o, zero-shot \| 27.11 \| 11.45 \| 27.89 \| 13.08 \| 28.63 \| 11.00 \| 31.27 \| 11.00 \|
	\| GPT-4, zero-shot \| 27.20 \| 11.59 \| 28.41 \| 13.24 \| 28.51 \| 10.99 \| 31.00 \| 10.93 \|
	\| NLLB-600M \| 33.96 \| 16.86 \| 30.12 \| 15.15 \| 35.05 \| 19.72 \| 31.92 \| 17.72 \|
	\| NLLB-1.3B \| 37.24 \| 17.73 \| 32.42 \| 16.21 \| 38.59 \| 22.79 \| 34.68 \| 20.89 \|
	\| NLLB-3.3B \| 38.57\| 17.09 \| 33.35\| 14.85 \| 40.61\| 24.71\| 35.20\| 22.44 \|
	\| NusaMT-7B (Ours) \| 35.42 \| 22.15\| 31.56 \| 22.95\| 37.23 \| 24.32 \| 34.29 \| 23.27\|

	This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.



	## Environmental Impact


	- Hardware Type: 2x NVIDIA RTX 4090
	- Hours used: 1250
	- Cloud Provider: Runpod.io
	- Carbon Emitted: 210 kg CO2e


	## Citation

	If you find this model useful, please cite the following works

	```
	@misc{tan2024nusamt7bmachinetranslationlowresource,
	title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models},
	author={William Tan and Kevin Zhu},
	year={2024},
	eprint={2410.07830},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2410.07830},
	}
	```