Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

[NAACL 2025]

Model Details

Model Description

The Protein2Text model is a multimodal transformer-based model designed to generate human-interpretable text from protein sequences. It combines a protein sequence encoder (ESM2) and a large language model (LLaMA 3.1-8B Instruct), leveraging resampling mechanisms to improve text generation. The model was trained and fine-tuned on the Protein2Text-QA dataset, which contains question-answer (QA) pairs generated from biomedical literature.

  • Developed by: TumorAI Lab
  • Model Type: Multimodal Instruction-Tuned Transformer
  • Language(s) (NLP): English (Biomedical Domain)
  • License: Apache 2.0
  • Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

Model Sources

  • Repository: GitHub Repository
  • Paper: [More Information Needed]
  • Demo: [More Information Needed]

Uses

Direct Use

  • Generating textual descriptions of protein functions from protein sequences.
  • Biomedical research and explainable AI applications in genomics and proteomics.

Downstream Use

  • Can be fine-tuned for specific protein annotation tasks.
  • Can be adapted for biomedical question-answering related to proteins.

Out-of-Scope Use

  • Not designed for general NLP tasks outside of biomedical research.
  • Should not be used for clinical decision-making without expert validation.

Bias, Risks, and Limitations

  • The model relies on automatically generated QA pairs, which may introduce hallucinated or inaccurate information.
  • Some rare proteins may not have sufficient training data, leading to unreliable outputs.
  • Always verify outputs with domain experts.
  • Further fine-tuning may be required for specific biomedical applications.

Training Details

Training Data

The model was fine-tuned on the Protein2Text-QA dataset, which includes:

  • Protein-related abstracts retrieved from PubMed Central (PMC).
  • QA pairs generated using LLaMA3, conditioned on specific protein mentions.

Training Procedure

Preprocessing

  • Abstract cleaning: Removal of redundant sections (e.g., "Methods", "Conclusion").
  • QA filtering: Removing phrases like "no information found".

Training Hyperparameters

Phase Global Batch Size Learning Rate Epochs Max Length Weight Decay Precision Optimizer Gradient Accumulation Steps Warmup Ratio
Pretraining 256 2 × 10⁻³ 1 2048 0 bf16 (Mixed Precision) AdamW 1 step 0.03
Fine-tuning 128 8 × 10⁻⁶ 5 2048 0 bf16 (Mixed Precision) AdamW 1 step 0.03

Evaluation

Metrics

  • BLEU-2, BLEU-4 (for text quality).
  • ROUGE-1, ROUGE-2, ROUGE-L (for relevance).
  • METEOR (for fluency).

Model Examination

  • Hardware Used: 2 × NVIDIA H100 PCIe 82GB
  • Training Hours: 12-15 hours

Citation

BibTeX:

@inproceedings{Protein2Text2025,
  title={Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text},
  author={Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis Tafoya,
Kushal Virupakshappa, Avinash Sahu},
  booktitle={NAACL 2025 - Industry Track},
  year={2025}
}
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train tumorailab/protein2text-llama3.1-8B-instruct-esm2-650M

Collection including tumorailab/protein2text-llama3.1-8B-instruct-esm2-650M