Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

[NAACL 2025]

Base Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Base Protein Encoder: facebook/esm2_t33_650M_UR50D
Library: PEFT

Model Details

Model Description

The Protein2Text model is a multimodal transformer-based model designed to generate human-interpretable text from protein sequences. It combines a protein sequence encoder (ESM2) and a large language model (LLaMA 3.1-8B Instruct), leveraging resampling mechanisms to improve text generation. The model was trained and fine-tuned on the Protein2Text-QA dataset, which contains question-answer (QA) pairs generated from biomedical literature.

Developed by: TumorAI Lab
Model Type: Multimodal Instruction-Tuned Transformer
Language(s) (NLP): English (Biomedical Domain)
License: Apache 2.0
Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

Model Sources

Repository: GitHub Repository
Paper: [More Information Needed]
Demo: [More Information Needed]

Uses

Direct Use

Generating textual descriptions of protein functions from protein sequences.
Biomedical research and explainable AI applications in genomics and proteomics.

Downstream Use

Can be fine-tuned for specific protein annotation tasks.
Can be adapted for biomedical question-answering related to proteins.

Out-of-Scope Use

Not designed for general NLP tasks outside of biomedical research.
Should not be used for clinical decision-making without expert validation.

Bias, Risks, and Limitations

The model relies on automatically generated QA pairs, which may introduce hallucinated or inaccurate information.
Some rare proteins may not have sufficient training data, leading to unreliable outputs.
Always verify outputs with domain experts.
Further fine-tuning may be required for specific biomedical applications.

Training Details

Training Data

The model was fine-tuned on the Protein2Text-QA dataset, which includes:

Protein-related abstracts retrieved from PubMed Central (PMC).
QA pairs generated using LLaMA3, conditioned on specific protein mentions.

Training Procedure

Preprocessing

Abstract cleaning: Removal of redundant sections (e.g., "Methods", "Conclusion").
QA filtering: Removing phrases like "no information found".

Training Hyperparameters

Phase	Global Batch Size	Learning Rate	Epochs	Max Length	Weight Decay	Precision	Optimizer	Gradient Accumulation Steps	Warmup Ratio
Pretraining	256	2 × 10⁻³	1	2048	0	bf16 (Mixed Precision)	AdamW	1 step	0.03
Fine-tuning	128	8 × 10⁻⁶	5	2048	0	bf16 (Mixed Precision)	AdamW	1 step	0.03

Evaluation

Metrics

BLEU-2, BLEU-4 (for text quality).
ROUGE-1, ROUGE-2, ROUGE-L (for relevance).
METEOR (for fluency).

Model Examination

Hardware Used: 2 × NVIDIA H100 PCIe 82GB
Training Hours: 12-15 hours

Citation

BibTeX:

@inproceedings{Protein2Text2025,
  title={Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text},
  author={Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis Tafoya,
Kushal Virupakshappa, Avinash Sahu},
  booktitle={NAACL 2025 - Industry Track},
  year={2025}
}

tumorailab
/

protein2text-llama3.1-8B-instruct-esm2-650M