File size: 4,948 Bytes
1182d14
 
ebd9069
 
1182d14
 
c632a16
1182d14
c632a16
1182d14
c632a16
1182d14
c632a16
1182d14
c632a16
1182d14
c632a16
 
 
 
 
 
1182d14
c632a16
1182d14
c632a16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e15aca
c632a16
 
 
 
 
 
 
 
 
 
1182d14
 
 
c632a16
 
1182d14
 
c632a16
 
 
 
1182d14
 
 
c632a16
 
 
1182d14
c632a16
1182d14
c632a16
1182d14
c632a16
152a057
 
 
 
 
 
 
 
 
 
 
c632a16
 
1182d14
c632a16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
library_name: transformers
license: mit
pipeline_tag: sentence-similarity
---

# DNA2Vec: Transformer-Based DNA Sequence Embedding

This repository provides an implementation of `dna2vec`, a transformer-based model designed for DNA sequence embeddings. It includes both the Hugging Face (`hf_model`) and a locally trained model (`local_model`). The model can be used for DNA sequence alignment, classification, and other genomic applications.

## Model Overview

DNA sequence alignment is an essential genomic task that involves mapping short DNA reads to the most probable locations within a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advances leverage transformer-based models to encode DNA sequences into vector representations. 

The `dna2vec` framework introduces a **Reference-Free DNA Embedding (RDE) Transformer model**, which encodes DNA sequences into a shared vector space, allowing for efficient similarity search and sequence alignment.

### Key Features:
- **Transformer-based architecture** trained on genomic data.
- **Reference-free embeddings** that enable efficient sequence retrieval.
- **Contrastive loss for self-supervised training**, ensuring robust sequence similarity learning.
- **Support for Hugging Face and custom-trained local models**.
- **Efficient search through a DNA vector store**, reducing genome-wide alignment to a local search.

## Model Details

### Model Architecture
The transformer model consists of:
- **12 attention heads**
- **6 encoder layers**
- **Embedding dimension:** 1020
- **Vocabulary size:** 10,000
- **Cosine similarity-based sequence matching**
- **Dropout:** 0.1
- **Training: Cosine Annealing learning rate scheduling**

## Installation

To use the model, install the required dependencies:

```bash
pip install transformers torch
```

## Usage

### Load Hugging Face Model

```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn as nn

def load_hf_model():
    hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
    hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
    
    class AveragePooler(nn.Module):
        def forward(self, last_hidden, attention_mask):
            return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
    
    hf_model.pooler = AveragePooler()
    return hf_model, hf_tokenizer, hf_model.pooler
```
###Using the Model
    Once the model is loaded, you can use it to obtain embeddings for DNA sequences:
    
```python
def get_embedding(dna_sequence):
    model, tokenizer, pooler = load_hf_model()
    tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True)
    with torch.no_grad():
        output = model(**tokenized_input)
    embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask)
    return embedding.numpy()

# Example usage
dna_seq = "ATGCGTACGTAGCTAGCTAGC"
embedding = get_embedding(dna_seq)
print("Embedding shape:", embedding.shape)
```

## Training Details

### Dataset
The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers **approximately 2% of the human genome**, ensuring generalization across different sequences. Reads are generated using **ART MiSeq** simulation, with variations in insertion and deletion rates.

### Training Procedure
- **Self-Supervised Learning:** Contrastive loss-based training.
- **Dynamic Length Sequences:** DNA fragments of length 800-2000 with reads sampled in [150, 500].
- **Noise Augmentation:** 1-5% random base substitutions in 40% of training reads.
- **Batch Size:** 16 with gradient accumulation.

## Evaluation

The model was evaluated against traditional aligners (Bowtie-2) and other Transformer-based baselines (DNABERT-2, HyenaDNA). The evaluation metrics include:
- **Alignment Recall:** >99% for high-quality reads.
- **Cross-Species Transfer:** Successfully aligns sequences from different species, including *Thermus Aquaticus* and *Rattus Norvegicus*.

## Citation

If you use this model, please cite:

```bibtex
@article{10.1093/bioinformatics/btaf041,
    author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani},
    title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models},
    journal = {Bioinformatics},
    pages = {btaf041},
    year = {2025},
    month = {02},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btaf041},
    url = {https://doi.org/10.1093/bioinformatics/btaf041},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf},
}
```

For more details, check the [full paper](https://arxiv.org/abs/2309.11087v6).