JingjingZhai commited on
Commit
1ff2552
·
verified ·
1 Parent(s): 2d77d53

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -50
README.md CHANGED
@@ -1,51 +1,53 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
5
- ## Model Overview
6
-
7
- PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
-
9
- - **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
10
- - **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
11
- - **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
12
- - **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters
13
-
14
- ## How to use
15
- ```python
16
- from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
- import torch
18
- model_path = 'kuleshov-group/PlantCaduceus_l32'
19
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
20
- model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
21
- model.eval()
22
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
23
-
24
- sequence = "ATGCGTACGATCGTAG"
25
- encoding = tokenizer.encode_plus(
26
- sequence,
27
- return_tensors="pt",
28
- return_attention_mask=False,
29
- return_token_type_ids=False
30
- )
31
- input_ids = encoding["input_ids"].to(device)
32
- with torch.inference_mode():
33
- outputs = model(input_ids=input_ids, output_hidden_states=True)
34
- ```
35
-
36
- ## Citation
37
- ```bibtex
38
- @article {Zhai2024.06.04.596709,
39
- author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
40
- title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
41
- elocation-id = {2024.06.04.596709},
42
- year = {2024},
43
- doi = {10.1101/2024.06.04.596709},
44
- URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
45
- eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
46
- journal = {bioRxiv}
47
- }
48
- ```
49
-
50
- ## Contact
 
 
51
  Jingjing Zhai ([email protected])
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ ## Model Overview
6
+
7
+ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
+
9
+ - **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
10
+ - **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
11
+ - **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
12
+ - **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters
13
+
14
+ Note: we would highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)) for the zero-shot score estimation.
15
+
16
+ ## How to use
17
+ ```python
18
+ from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
19
+ import torch
20
+ model_path = 'kuleshov-group/PlantCaduceus_l32'
21
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
22
+ model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
23
+ model.eval()
24
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
25
+
26
+ sequence = "ATGCGTACGATCGTAG"
27
+ encoding = tokenizer.encode_plus(
28
+ sequence,
29
+ return_tensors="pt",
30
+ return_attention_mask=False,
31
+ return_token_type_ids=False
32
+ )
33
+ input_ids = encoding["input_ids"].to(device)
34
+ with torch.inference_mode():
35
+ outputs = model(input_ids=input_ids, output_hidden_states=True)
36
+ ```
37
+
38
+ ## Citation
39
+ ```bibtex
40
+ @article {Zhai2024.06.04.596709,
41
+ author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
42
+ title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
43
+ elocation-id = {2024.06.04.596709},
44
+ year = {2024},
45
+ doi = {10.1101/2024.06.04.596709},
46
+ URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
47
+ eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
48
+ journal = {bioRxiv}
49
+ }
50
+ ```
51
+
52
+ ## Contact
53
  Jingjing Zhai ([email protected])