|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<h3>InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design</h3> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2410.07919">Paper</a> β’ |
|
|
<a href="https://github.com/HICAI-ZJU/InstructBioMol">Project</a> β’ |
|
|
<a href="#quickstart">Quickstart</a> β’ |
|
|
<a href="#citation">Citation</a> |
|
|
</p> |
|
|
</div> |
|
|
|
|
|
### Model Description |
|
|
|
|
|
InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning. |
|
|
|
|
|
*For detailed information, please refer to our [paper](https://arxiv.org/abs/2410.07919) and [code repository](https://github.com/HICAI-ZJU/InstructBioMol).* |
|
|
### Released Variants |
|
|
|
|
|
| Model Name | Stage | Multimodal| Description | |
|
|
|------------|-----------| -------| -------| |
|
|
| [InstructBioMol-base](https://huggingface.co/hicai-zju/InstructBioMol-base) | Pretraining | β| Continual pretrained model on molecular sequences, protein sequences, and scientific literature. | |
|
|
| [InstructBioMol-instruct-stage1](https://huggingface.co/hicai-zju/InstructBioMol-instruct-stage1) (*This Model*) | Instruction tuning (stage 1) | β
| Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) | |
|
|
| [InstructBioMol-instruct](https://huggingface.co/hicai-zju/InstructBioMol-instruct) | Instruction tuning (stage 1 and 2) | β
| Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) | |
|
|
|
|
|
### Training Details |
|
|
|
|
|
**Base Architecture**: InstructBioMol-base |
|
|
|
|
|
**Training Data**: |
|
|
|
|
|
β1. Molecule - Natural Language Alignment: |
|
|
- 60 million data from pubchem and chebi |
|
|
|
|
|
β2. Protein - Natural Langauge Alignment: |
|
|
- 35 million data from UniProt (Swiss-Prot and TrEMBL) |
|
|
|
|
|
β3. Molecule - Protein Alignment: |
|
|
- 1 million data from BindingDB and Rhea |
|
|
|
|
|
|
|
|
**Training Objective**: Instruction tuning |
|
|
|
|
|
|
|
|
### Citation |
|
|
|
|
|
```bibtex |
|
|
@article{zhuang2025advancing, |
|
|
author = {Xiang Zhuang and |
|
|
Keyan Ding and |
|
|
Tianwen Lyu and |
|
|
Yinuo Jiang and |
|
|
Xiaotong Li and |
|
|
Zhuoyi Xiang and |
|
|
Zeyuan Wang and |
|
|
Ming Qin and |
|
|
Kehua Feng and |
|
|
Jike Wang and |
|
|
Qiang Zhang and |
|
|
Huajun Chen}, |
|
|
title={Advancing biomolecular understanding and design following human instructions}, |
|
|
journal={Nature Machine Intelligence}, |
|
|
pages={1--14}, |
|
|
year={2025}, |
|
|
publisher={Nature Publishing Group UK London} |
|
|
} |
|
|
``` |