File size: 3,826 Bytes
2e9d3dc
 
 
 
 
 
f86a0fe
1284de5
97ab7ed
1284de5
 
 
 
27942e1
 
 
 
d108dd4
27942e1
d108dd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27942e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
---

# SegmentBorzoi

SegmentBorzoi is a segmentation model leveraging [Borzoi](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1.abstract) to predict the location of several types of genomics 
elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and 
tissue-specific promoters and enhancers, and CTCF-bound sites) elements.


**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)

### How to use

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models. PyTorch should also be installed in order to one-hot encode the input sequences.

```
pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch
```

A small snippet of code is given here in order to retrieve both logits from dummy DNA sequences.

```
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("InstaDeepAI/segment_borzoi", trust_remote_code=True)

def encode_sequences(sequences):
    one_hot_map = {
        'a': torch.tensor([1., 0., 0., 0.]),
        'c': torch.tensor([0., 1., 0., 0.]),
        'g': torch.tensor([0., 0., 1., 0.]),
        't': torch.tensor([0., 0., 0., 1.]),
        'n': torch.tensor([0., 0., 0., 0.]),
        'A': torch.tensor([1., 0., 0., 0.]),
        'C': torch.tensor([0., 1., 0., 0.]),
        'G': torch.tensor([0., 0., 1., 0.]),
        'T': torch.tensor([0., 0., 0., 1.]),
        'N': torch.tensor([0., 0., 0., 0.])
    }

    def encode_sequence(seq_str):
        one_hot_list = []
        for char in seq_str:
            one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25]))
            one_hot_list.append(one_hot_vector)
        return torch.stack(one_hot_list)

    if isinstance(sequences, list):
        return torch.stack([encode_sequence(seq) for seq in sequences])
    else:
        return encode_sequence(sequences)

sequences = ["A"*524_288, "G"*524_288]
one_hot_encoding = encode_sequences(sequences)
preds = model(one_hot_encoding)
print(preds['logits'])
```


## Training data

The **SegmentBorzoi** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by 
using a sliding window of length 524kb (original borzoi input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.

## Training procedure

### Preprocessing

The DNA sequences are tokenized using one-hot encoding similar to the Enformer model

### Architecture

The model is composed of the Borzoi backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these 
blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively.

### BibTeX entry and citation info

```bibtex
@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

```