File size: 3,835 Bytes
2e9d3dc
 
 
 
 
 
f86a0fe
1284de5
97ab7ed
1284de5
 
 
 
27942e1
 
 
 
703ad7b
 
27942e1
d108dd4
 
e3d58de
d108dd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27942e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
---

# SegmentBorzoi

SegmentBorzoi is a segmentation model leveraging [Borzoi](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1.abstract) to predict the location of several types of genomics 
elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and 
tissue-specific promoters and enhancers, and CTCF-bound sites) elements.


**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)

### How to use

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
PyTorch, einops and borzoi_pytorch should also be installed.

```
pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch einops borzoi_pytorch==0.4.0
```

A small snippet of code is given here in order to retrieve both logits from dummy DNA sequences.

```
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("InstaDeepAI/segment_borzoi", trust_remote_code=True)

def encode_sequences(sequences):
    one_hot_map = {
        'a': torch.tensor([1., 0., 0., 0.]),
        'c': torch.tensor([0., 1., 0., 0.]),
        'g': torch.tensor([0., 0., 1., 0.]),
        't': torch.tensor([0., 0., 0., 1.]),
        'n': torch.tensor([0., 0., 0., 0.]),
        'A': torch.tensor([1., 0., 0., 0.]),
        'C': torch.tensor([0., 1., 0., 0.]),
        'G': torch.tensor([0., 0., 1., 0.]),
        'T': torch.tensor([0., 0., 0., 1.]),
        'N': torch.tensor([0., 0., 0., 0.])
    }

    def encode_sequence(seq_str):
        one_hot_list = []
        for char in seq_str:
            one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25]))
            one_hot_list.append(one_hot_vector)
        return torch.stack(one_hot_list)

    if isinstance(sequences, list):
        return torch.stack([encode_sequence(seq) for seq in sequences])
    else:
        return encode_sequence(sequences)

sequences = ["A"*524_288, "G"*524_288]
one_hot_encoding = encode_sequences(sequences)
preds = model(one_hot_encoding)
print(preds['logits'])
```


## Training data

The **SegmentBorzoi** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by 
using a sliding window of length 524kb (original borzoi input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.

## Training procedure

### Preprocessing

The DNA sequences are tokenized using one-hot encoding similar to the Enformer model

### Architecture

The model is composed of the Borzoi backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these 
blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively.

### BibTeX entry and citation info

```bibtex
@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

```