hu-lab commited on
Commit
8b1cf03
·
verified ·
1 Parent(s): 7a71c89

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - biology
4
+ ---
5
+ # Model Card for Model ID
6
+ PlantGFM is a genetic foundation model pre-trained on the complete genome sequences of 12 model plants, encompassing 108 billion nucleotides. Using the Hyena framework with 220 million parameters and a context length of 64K bp, PlantGFM models sequences at single-nucleotide resolution. The model employed a length warm-up strategy, starting with 1K bp fragments and gradually increasing to 64K bp, enhancing training stability and accelerating convergence.
7
+
8
+ ### Model Sources
9
+
10
+ - **Repository:** [PlantGFM](https://github.com/hu-lab-PlantGLM/PlantGLM)
11
+ - **Manuscript:** [A Genetic Foundation Model for Discovery and Creation of Plant Genes]()
12
+
13
+ **Developed by:** hu-lab
14
+
15
+ # How to use the model
16
+
17
+ Install the runtime library first:
18
+ ```bash
19
+ pip install transformers
20
+ ```
21
+ To calculate the embedding of a dna sequence:
22
+ ```python
23
+
24
+ import torch
25
+ from transformers import PreTrainedTokenizerFast
26
+ from plantgfm.modeling_plantgfm import PlantGFMForCausalLM
27
+ from plantgfm.configuration_plantgfm import PlantGFMConfig
28
+
29
+ config = PlantGFMConfig.from_pretrained("hu-lab/PlantGFM")
30
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("hu-lab/PlantGFM")
31
+ model = PlantGFMForCausalLM.from_pretrained("hu-lab/PlantGFM", config=config)
32
+
33
+
34
+ sequences = ["CCCTAAACCCTAAACCCTAAA", "ATGGCGTGGCTG"]
35
+
36
+ # get single-nucleotide sequences with space between each base
37
+ single_nucleotide_sequences = list(map(lambda seq: " ".join(list(seq)), sequences))
38
+
39
+
40
+ tokenized_sequences = tokenizer(single_nucleotide_sequences, padding="longest")["input_ids"]
41
+ input_ids = torch.LongTensor(tokenized_sequences)
42
+
43
+ embd = model(input_ids=input_ids, output_hidden_states=True)["hidden_states"][0]
44
+ print(embd)
45
+ ```
46
+
47
+
48
+
49
+ #### Hardware
50
+ Model was trained for 468 hours on 8 Nvidia A800-80G GPUs.