hu-lab
/

PlantGFM-Gene-generation

Model card Files Files and versions Community

PlantGFM-Gene-generation / README.md

hu-lab's picture

Update README.md

bc5eaae verified 19 days ago

|

2.08 kB

	---
	tags:
	- biology
	---
	# Model Card for Model ID
	PlantGFM-Gene-generation is a gene generation model re-trained from PlantGFM using DNA sequences of 355,190 natural plant genes with lengths less than or equal to 4,000 base pairs. The model was re-trained with prompt-based training for two epochs, using the prompt "gene" to guide the learning process and help the model generate novel plant gene sequences that align with the patterns and structures of natural genes.

	### Model Sources

	- Repository: [PlantGFM](https://github.com/hu-lab-PlantGLM/PlantGLM)
	- Manuscript: [A Genetic Foundation Model for Discovery and Creation of Plant Genes]()

	Developed by: hu-lab

	# How to use the model

	Install the runtime library first:
	```bash
	pip install transformers
	```
	To generate a new gene sequence using the model:
	```python
	import torch
	from transformers import PreTrainedTokenizerFast
	from torch.cuda.amp import autocast
	from plantgfm.configuration_plantgfm import PlantGFMConfig
	from plantgfm.modeling_plantgfm import PlantGFMForCausalLM

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	config = PlantGFMConfig.from_pretrained("hu-lab/PlantGFM-Gene-generation ")
	tokenizer = PreTrainedTokenizerFast.from_pretrained("hu-lab/PlantGFM-Gene-generation")
	model = PlantGFMForCausalLM.from_pretrained("hu-lab/PlantGFM-Gene-generation", config=config).to(device)
	model = model.to(dtype=torch.bfloat16)

	num_texts = 1
	batch_size = 1
	generated_texts = []


	input_ids = tokenizer.encode("", return_tensors="pt").to(device, dtype=torch.long)
	input_ids = input_ids.expand(batch_size, -1)


	for i in range(0, num_texts, batch_size):
	with autocast(dtype=torch.bfloat16):
	generated_text = model.generate(
	input_ids=input_ids,
	max_length=4000,
	do_sample=True,
	)
	for output_sequence in output:
	generated_text = tokenizer.decode(output_sequence, skip_special_tokens=True)
	print(generated_text)
	```



	#### Hardware
	Model was trained for 15 hours on 2 Nvidia A100-40G GPUs.