kuleshov-group
/

PlantCaduceus_l20

Feature Extraction

Model card Files Files and versions Community

PlantCaduceus_l20 / README.md

Jingjing Zhai

Brief description of PlantCaduceus

5aa171c 9 months ago

|

1.45 kB

	---
	license: apache-2.0
	---

	## Model Overview

	PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:

	- PlantCaduceus_l20: 20 layers, 384 hidden size, 20M parameters
	- PlantCaduceus_l24: 24 layers, 512 hidden size, 40M parameters
	- PlantCaduceus_l28: 28 layers, 768 hidden size, 112M parameters
	- PlantCaduceus_l32: 32 layers, 1024 hidden size, 225M parameters

	## How to use
	```python
	from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
	model_path = 'maize-genetics/PlantCaduceus_l20'
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
	model.eval()
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

	sequence = "ATGCGTACGATCGTAG"
	encoding = tokenizer.encode_plus(
	sequence,
	return_tensors="pt",
	return_attention_mask=False,
	return_token_type_ids=False
	)
	input_ids = encoding["input_ids"].to(device)
	with torch.inference_mode():
	outputs = model(input_ids=input_ids, output_hidden_states=True)
	```