hu-lab
/

PlantGFM-Gene-generation

Model card Files Files and versions Community

hu-lab commited on 19 days ago

Commit

8b1cf03

·

verified ·

1 Parent(s): 7a71c89

Create README.md

Files changed (1) hide show

README.md +50 -0

README.md ADDED Viewed

	@@ -0,0 +1,50 @@

+---
+tags:
+- biology
+---
+# Model Card for Model ID
+PlantGFM is a genetic foundation model pre-trained on the complete genome sequences of 12 model plants, encompassing 108 billion nucleotides. Using the Hyena framework with 220 million parameters and a context length of 64K bp, PlantGFM models sequences at single-nucleotide resolution. The model employed a length warm-up strategy, starting with 1K bp fragments and gradually increasing to 64K bp, enhancing training stability and accelerating convergence.
+### Model Sources
+- **Repository:** [PlantGFM](https://github.com/hu-lab-PlantGLM/PlantGLM)
+- **Manuscript:** [A Genetic Foundation Model for Discovery and Creation of Plant Genes]()
+**Developed by:** hu-lab
+# How to use the model
+Install the runtime library first:
+```bash
+pip install transformers
+```
+To calculate the embedding of a dna sequence:
+```python
+import torch
+from transformers import PreTrainedTokenizerFast
+from plantgfm.modeling_plantgfm import PlantGFMForCausalLM
+from plantgfm.configuration_plantgfm import PlantGFMConfig
+config = PlantGFMConfig.from_pretrained("hu-lab/PlantGFM")
+tokenizer = PreTrainedTokenizerFast.from_pretrained("hu-lab/PlantGFM")
+model = PlantGFMForCausalLM.from_pretrained("hu-lab/PlantGFM", config=config)
+sequences = ["CCCTAAACCCTAAACCCTAAA", "ATGGCGTGGCTG"]
+# get single-nucleotide sequences with space between each base
+single_nucleotide_sequences = list(map(lambda seq: " ".join(list(seq)), sequences))
+tokenized_sequences = tokenizer(single_nucleotide_sequences, padding="longest")["input_ids"]
+input_ids = torch.LongTensor(tokenized_sequences)
+embd = model(input_ids=input_ids, output_hidden_states=True)["hidden_states"][0]
+print(embd)
+```
+#### Hardware
+Model was trained for 468 hours on 8 Nvidia A800-80G GPUs.