Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- biology
|
4 |
+
---
|
5 |
+
# Model Card for Model ID
|
6 |
+
PlantGFM is a genetic foundation model pre-trained on the complete genome sequences of 12 model plants, encompassing 108 billion nucleotides. Using the Hyena framework with 220 million parameters and a context length of 64K bp, PlantGFM models sequences at single-nucleotide resolution. The model employed a length warm-up strategy, starting with 1K bp fragments and gradually increasing to 64K bp, enhancing training stability and accelerating convergence.
|
7 |
+
|
8 |
+
### Model Sources
|
9 |
+
|
10 |
+
- **Repository:** [PlantGFM](https://github.com/hu-lab-PlantGLM/PlantGLM)
|
11 |
+
- **Manuscript:** [A Genetic Foundation Model for Discovery and Creation of Plant Genes]()
|
12 |
+
|
13 |
+
**Developed by:** hu-lab
|
14 |
+
|
15 |
+
# How to use the model
|
16 |
+
|
17 |
+
Install the runtime library first:
|
18 |
+
```bash
|
19 |
+
pip install transformers
|
20 |
+
```
|
21 |
+
To calculate the embedding of a dna sequence:
|
22 |
+
```python
|
23 |
+
|
24 |
+
import torch
|
25 |
+
from transformers import PreTrainedTokenizerFast
|
26 |
+
from plantgfm.modeling_plantgfm import PlantGFMForCausalLM
|
27 |
+
from plantgfm.configuration_plantgfm import PlantGFMConfig
|
28 |
+
|
29 |
+
config = PlantGFMConfig.from_pretrained("hu-lab/PlantGFM")
|
30 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained("hu-lab/PlantGFM")
|
31 |
+
model = PlantGFMForCausalLM.from_pretrained("hu-lab/PlantGFM", config=config)
|
32 |
+
|
33 |
+
|
34 |
+
sequences = ["CCCTAAACCCTAAACCCTAAA", "ATGGCGTGGCTG"]
|
35 |
+
|
36 |
+
# get single-nucleotide sequences with space between each base
|
37 |
+
single_nucleotide_sequences = list(map(lambda seq: " ".join(list(seq)), sequences))
|
38 |
+
|
39 |
+
|
40 |
+
tokenized_sequences = tokenizer(single_nucleotide_sequences, padding="longest")["input_ids"]
|
41 |
+
input_ids = torch.LongTensor(tokenized_sequences)
|
42 |
+
|
43 |
+
embd = model(input_ids=input_ids, output_hidden_states=True)["hidden_states"][0]
|
44 |
+
print(embd)
|
45 |
+
```
|
46 |
+
|
47 |
+
|
48 |
+
|
49 |
+
#### Hardware
|
50 |
+
Model was trained for 468 hours on 8 Nvidia A800-80G GPUs.
|