segment_enformer / README.md

Update README.md

91ef38a verified about 1 month ago

4.18 kB

	---
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	---

	# SegmentEnformer

	SegmentEnformer is a segmentation model leveraging [Enformer](https://www.nature.com/articles/s41592-021-01252-x) to predict the location of several types of genomics
	elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
	tissue-specific promoters and enhancers, and CTCF-bound sites) elements.


	Developed by: [InstaDeep](https://huggingface.co/InstaDeepAI)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
	- Paper: [Segmenting the genome at single-nucleotide resolution with DNA foundation models](https://www.biorxiv.org/content/biorxiv/early/2024/03/15/2024.03.14.584712.full.pdf)

	### How to use

	Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
	PyTorch, einops and enformer_pytorch should also be installed.

	```
	pip install --upgrade git+https://github.com/huggingface/transformers.git
	!pip install torch einops enformer_pytorch==0.7.6
	```

	A small snippet of code is given here in order to retrieve both logits from dummy DNA sequences.

	```
	import torch
	from transformers import AutoModel

	model = AutoModel.from_pretrained("InstaDeepAI/segment_enformer", trust_remote_code=True)

	def encode_sequences(sequences):
	one_hot_map = {
	'a': torch.tensor([1., 0., 0., 0.]),
	'c': torch.tensor([0., 1., 0., 0.]),
	'g': torch.tensor([0., 0., 1., 0.]),
	't': torch.tensor([0., 0., 0., 1.]),
	'n': torch.tensor([0., 0., 0., 0.]),
	'A': torch.tensor([1., 0., 0., 0.]),
	'C': torch.tensor([0., 1., 0., 0.]),
	'G': torch.tensor([0., 0., 1., 0.]),
	'T': torch.tensor([0., 0., 0., 1.]),
	'N': torch.tensor([0., 0., 0., 0.])
	}

	def encode_sequence(seq_str):
	one_hot_list = []
	for char in seq_str:
	one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25]))
	one_hot_list.append(one_hot_vector)
	return torch.stack(one_hot_list)

	if isinstance(sequences, list):
	return torch.stack([encode_sequence(seq) for seq in sequences])
	else:
	return encode_sequence(sequences)

	sequences = ["A"196608, "G"196608]
	one_hot_encoding = encode_sequences(sequences)
	preds = model(one_hot_encoding)
	print(preds['logits'])
	```

	## Training data

	The SegmentEnformer model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
	During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
	using a sliding window of length 196kb (original enformer input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.

	## Training procedure

	### Preprocessing

	The DNA sequences are tokenized using one-hot encoding similar to the Enformer model

	### Architecture

	The model is composed of the Enformer backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
	blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively.

	### BibTeX entry and citation info

	```bibtex
	@article{de2024segmentnt,
	title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
	author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
	journal={bioRxiv},
	pages={2024--03},
	year={2024},
	publisher={Cold Spring Harbor Laboratory}
	}

	```