Jingjing Zhai
commited on
Commit
·
792da0e
1
Parent(s):
71df09f
Update README
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
|
|
4 |
|
5 |
## Model Overview
|
6 |
|
7 |
-
PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus
|
8 |
|
9 |
- **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
|
10 |
- **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
|
@@ -14,7 +14,7 @@ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Util
|
|
14 |
## How to use
|
15 |
```python
|
16 |
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
|
17 |
-
model_path = '
|
18 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
19 |
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
|
20 |
model.eval()
|
@@ -30,4 +30,24 @@ encoding = tokenizer.encode_plus(
|
|
30 |
input_ids = encoding["input_ids"].to(device)
|
31 |
with torch.inference_mode():
|
32 |
outputs = model(input_ids=input_ids, output_hidden_states=True)
|
33 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
5 |
## Model Overview
|
6 |
|
7 |
+
PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
|
8 |
|
9 |
- **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters
|
10 |
- **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters
|
|
|
14 |
## How to use
|
15 |
```python
|
16 |
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
|
17 |
+
model_path = 'kuleshov-group/PlantCaduceus_l32'
|
18 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
19 |
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
|
20 |
model.eval()
|
|
|
30 |
input_ids = encoding["input_ids"].to(device)
|
31 |
with torch.inference_mode():
|
32 |
outputs = model(input_ids=input_ids, output_hidden_states=True)
|
33 |
+
```
|
34 |
+
|
35 |
+
## Citation
|
36 |
+
```bibtex
|
37 |
+
@article {Zhai2024.06.04.596709,
|
38 |
+
author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
|
39 |
+
title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
|
40 |
+
elocation-id = {2024.06.04.596709},
|
41 |
+
year = {2024},
|
42 |
+
doi = {10.1101/2024.06.04.596709},
|
43 |
+
publisher = {Cold Spring Harbor Laboratory},
|
44 |
+
abstract = {Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.Competing Interest StatementThe authors have declared no competing interest.},
|
45 |
+
URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
|
46 |
+
eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
|
47 |
+
journal = {bioRxiv}
|
48 |
+
}
|
49 |
+
|
50 |
+
```
|
51 |
+
|
52 |
+
## Contact
|
53 |
+
Jingjing Zhai ([email protected])
|