--- tags: - model_hub_mixin - pytorch_model_hub_mixin --- # SegmentBorzoi SegmentBorzoi is a segmentation model leveraging [Borzoi](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1.abstract) to predict the location of several types of genomics elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and tissue-specific promoters and enhancers, and CTCF-bound sites) elements. **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) ### How to use Until its next release, the transformers library needs to be installed from source with the following command in order to use the models. PyTorch, einops and borzoi_pytorch should also be installed. ``` pip install --upgrade git+https://github.com/huggingface/transformers.git pip install torch einops borzoi_pytorch==0.4.0 ``` A small snippet of code is given here in order to retrieve both logits from dummy DNA sequences. ``` import torch from transformers import AutoModel model = AutoModel.from_pretrained("InstaDeepAI/segment_borzoi", trust_remote_code=True) def encode_sequences(sequences): one_hot_map = { 'a': torch.tensor([1., 0., 0., 0.]), 'c': torch.tensor([0., 1., 0., 0.]), 'g': torch.tensor([0., 0., 1., 0.]), 't': torch.tensor([0., 0., 0., 1.]), 'n': torch.tensor([0., 0., 0., 0.]), 'A': torch.tensor([1., 0., 0., 0.]), 'C': torch.tensor([0., 1., 0., 0.]), 'G': torch.tensor([0., 0., 1., 0.]), 'T': torch.tensor([0., 0., 0., 1.]), 'N': torch.tensor([0., 0., 0., 0.]) } def encode_sequence(seq_str): one_hot_list = [] for char in seq_str: one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25])) one_hot_list.append(one_hot_vector) return torch.stack(one_hot_list) if isinstance(sequences, list): return torch.stack([encode_sequence(seq) for seq in sequences]) else: return encode_sequence(sequences) sequences = ["A"*524_288, "G"*524_288] one_hot_encoding = encode_sequences(sequences) preds = model(one_hot_encoding) print(preds['logits']) ``` ## Training data The **SegmentBorzoi** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by using a sliding window of length 524kb (original borzoi input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping. ## Training procedure ### Preprocessing The DNA sequences are tokenized using one-hot encoding similar to the Enformer model ### Architecture The model is composed of the Borzoi backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. ### BibTeX entry and citation info ```bibtex @article{de2024segmentnt, title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models}, author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others}, journal={bioRxiv}, pages={2024--03}, year={2024}, publisher={Cold Spring Harbor Laboratory} } ```