File size: 4,948 Bytes
1182d14 ebd9069 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 6e15aca c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 1182d14 c632a16 152a057 c632a16 1182d14 c632a16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
library_name: transformers
license: mit
pipeline_tag: sentence-similarity
---
# DNA2Vec: Transformer-Based DNA Sequence Embedding
This repository provides an implementation of `dna2vec`, a transformer-based model designed for DNA sequence embeddings. It includes both the Hugging Face (`hf_model`) and a locally trained model (`local_model`). The model can be used for DNA sequence alignment, classification, and other genomic applications.
## Model Overview
DNA sequence alignment is an essential genomic task that involves mapping short DNA reads to the most probable locations within a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advances leverage transformer-based models to encode DNA sequences into vector representations.
The `dna2vec` framework introduces a **Reference-Free DNA Embedding (RDE) Transformer model**, which encodes DNA sequences into a shared vector space, allowing for efficient similarity search and sequence alignment.
### Key Features:
- **Transformer-based architecture** trained on genomic data.
- **Reference-free embeddings** that enable efficient sequence retrieval.
- **Contrastive loss for self-supervised training**, ensuring robust sequence similarity learning.
- **Support for Hugging Face and custom-trained local models**.
- **Efficient search through a DNA vector store**, reducing genome-wide alignment to a local search.
## Model Details
### Model Architecture
The transformer model consists of:
- **12 attention heads**
- **6 encoder layers**
- **Embedding dimension:** 1020
- **Vocabulary size:** 10,000
- **Cosine similarity-based sequence matching**
- **Dropout:** 0.1
- **Training: Cosine Annealing learning rate scheduling**
## Installation
To use the model, install the required dependencies:
```bash
pip install transformers torch
```
## Usage
### Load Hugging Face Model
```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn as nn
def load_hf_model():
hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
class AveragePooler(nn.Module):
def forward(self, last_hidden, attention_mask):
return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
hf_model.pooler = AveragePooler()
return hf_model, hf_tokenizer, hf_model.pooler
```
###Using the Model
Once the model is loaded, you can use it to obtain embeddings for DNA sequences:
```python
def get_embedding(dna_sequence):
model, tokenizer, pooler = load_hf_model()
tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True)
with torch.no_grad():
output = model(**tokenized_input)
embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask)
return embedding.numpy()
# Example usage
dna_seq = "ATGCGTACGTAGCTAGCTAGC"
embedding = get_embedding(dna_seq)
print("Embedding shape:", embedding.shape)
```
## Training Details
### Dataset
The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers **approximately 2% of the human genome**, ensuring generalization across different sequences. Reads are generated using **ART MiSeq** simulation, with variations in insertion and deletion rates.
### Training Procedure
- **Self-Supervised Learning:** Contrastive loss-based training.
- **Dynamic Length Sequences:** DNA fragments of length 800-2000 with reads sampled in [150, 500].
- **Noise Augmentation:** 1-5% random base substitutions in 40% of training reads.
- **Batch Size:** 16 with gradient accumulation.
## Evaluation
The model was evaluated against traditional aligners (Bowtie-2) and other Transformer-based baselines (DNABERT-2, HyenaDNA). The evaluation metrics include:
- **Alignment Recall:** >99% for high-quality reads.
- **Cross-Species Transfer:** Successfully aligns sequences from different species, including *Thermus Aquaticus* and *Rattus Norvegicus*.
## Citation
If you use this model, please cite:
```bibtex
@article{10.1093/bioinformatics/btaf041,
author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani},
title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models},
journal = {Bioinformatics},
pages = {btaf041},
year = {2025},
month = {02},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf041},
url = {https://doi.org/10.1093/bioinformatics/btaf041},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf},
}
```
For more details, check the [full paper](https://arxiv.org/abs/2309.11087v6).
|