|
--- |
|
tags: |
|
- model_hub_mixin |
|
- pytorch_model_hub_mixin |
|
--- |
|
|
|
# segment-borzoi |
|
|
|
SegmentBorzoi is a segmentation model leveraging [Borzoi](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1.abstract) to predict the location of several types of genomics |
|
elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and |
|
tissue-specific promoters and enhancers, and CTCF-bound sites) elements. |
|
|
|
|
|
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
|
|
|
### How to use |
|
|
|
To Be Done |
|
|
|
|
|
|
|
## Training data |
|
|
|
The **SegmentBorzoi** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. |
|
During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by |
|
using a sliding window of length 524kb (original borzoi input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The DNA sequences are tokenized using one-hot encoding similar to the Enformer model |
|
|
|
### Architecture |
|
|
|
The model is composed of the Borzoi backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these |
|
blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{de2024segmentnt, |
|
title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models}, |
|
author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others}, |
|
journal={bioRxiv}, |
|
pages={2024--03}, |
|
year={2024}, |
|
publisher={Cold Spring Harbor Laboratory} |
|
} |
|
|
|
``` |
|
|