bernardo-de-almeida commited on
Commit
27942e1
·
verified ·
1 Parent(s): d3654a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -1
README.md CHANGED
@@ -11,4 +11,41 @@ elements in a sequence at a single nucleotide resolution. It was trained on 14 d
11
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
12
 
13
 
14
- **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
12
 
13
 
14
+ **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
15
+
16
+ ### How to use
17
+
18
+ To Be Done
19
+
20
+
21
+
22
+ ## Training data
23
+
24
+ The **SegmentBorzoi** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
25
+ During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
26
+ using a sliding window of length 524kb (original borzoi input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
27
+
28
+ ## Training procedure
29
+
30
+ ### Preprocessing
31
+
32
+ The DNA sequences are tokenized using one-hot encoding similar to the Enformer model
33
+
34
+ ### Architecture
35
+
36
+ The model is composed of the Borzoi backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
37
+ blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively.
38
+
39
+ ### BibTeX entry and citation info
40
+
41
+ ```bibtex
42
+ @article{de2024segmentnt,
43
+ title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
44
+ author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
45
+ journal={bioRxiv},
46
+ pages={2024--03},
47
+ year={2024},
48
+ publisher={Cold Spring Harbor Laboratory}
49
+ }
50
+
51
+ ```