Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text
[NAACL 2025]
Model Details
Model Description
The Protein2Text model is a multimodal transformer-based model designed to generate human-interpretable text from protein sequences. It combines a protein sequence encoder (ESM2) and a large language model (LLaMA 3.1-8B Instruct), leveraging resampling mechanisms to improve text generation. The model was trained and fine-tuned on the Protein2Text-QA dataset, which contains question-answer (QA) pairs generated from biomedical literature.
- Developed by: TumorAI Lab
- Model Type: Multimodal Instruction-Tuned Transformer
- Language(s) (NLP): English (Biomedical Domain)
- License: Apache 2.0
- Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct
Model Sources
- Repository: GitHub Repository
- Paper: [More Information Needed]
- Demo: [More Information Needed]
Uses
Direct Use
- Generating textual descriptions of protein functions from protein sequences.
- Biomedical research and explainable AI applications in genomics and proteomics.
Downstream Use
- Can be fine-tuned for specific protein annotation tasks.
- Can be adapted for biomedical question-answering related to proteins.
Out-of-Scope Use
- Not designed for general NLP tasks outside of biomedical research.
- Should not be used for clinical decision-making without expert validation.
Bias, Risks, and Limitations
- The model relies on automatically generated QA pairs, which may introduce hallucinated or inaccurate information.
- Some rare proteins may not have sufficient training data, leading to unreliable outputs.
- Always verify outputs with domain experts.
- Further fine-tuning may be required for specific biomedical applications.
Training Details
Training Data
The model was fine-tuned on the Protein2Text-QA dataset, which includes:
- Protein-related abstracts retrieved from PubMed Central (PMC).
- QA pairs generated using LLaMA3, conditioned on specific protein mentions.
Training Procedure
Preprocessing
- Abstract cleaning: Removal of redundant sections (e.g., "Methods", "Conclusion").
- QA filtering: Removing phrases like "no information found".
Training Hyperparameters
Phase |
Global Batch Size |
Learning Rate |
Epochs |
Max Length |
Weight Decay |
Precision |
Optimizer |
Gradient Accumulation Steps |
Warmup Ratio |
Pretraining |
256 |
2 × 10⁻³ |
1 |
2048 |
0 |
bf16 (Mixed Precision) |
AdamW |
1 step |
0.03 |
Fine-tuning |
128 |
8 × 10⁻⁶ |
5 |
2048 |
0 |
bf16 (Mixed Precision) |
AdamW |
1 step |
0.03 |
Evaluation
Metrics
- BLEU-2, BLEU-4 (for text quality).
- ROUGE-1, ROUGE-2, ROUGE-L (for relevance).
- METEOR (for fluency).
Model Examination
- Hardware Used: 2 × NVIDIA H100 PCIe 82GB
- Training Hours: 12-15 hours
Citation
BibTeX:
@inproceedings{Protein2Text2025,
title={Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text},
author={Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis Tafoya,
Kushal Virupakshappa, Avinash Sahu},
booktitle={NAACL 2025 - Industry Track},
year={2025}
}