DNA2Vec: Transformer-Based DNA Sequence Embedding

This repository provides an implementation of dna2vec, a transformer-based model designed for DNA sequence embeddings. It includes both the Hugging Face (hf_model) and a locally trained model (local_model). The model can be used for DNA sequence alignment, classification, and other genomic applications.

Model Overview

DNA sequence alignment is an essential genomic task that involves mapping short DNA reads to the most probable locations within a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advances leverage transformer-based models to encode DNA sequences into vector representations.

The dna2vec framework introduces a Reference-Free DNA Embedding (RDE) Transformer model, which encodes DNA sequences into a shared vector space, allowing for efficient similarity search and sequence alignment.

Key Features:

Transformer-based architecture trained on genomic data.
Reference-free embeddings that enable efficient sequence retrieval.
Contrastive loss for self-supervised training, ensuring robust sequence similarity learning.
Support for Hugging Face and custom-trained local models.
Efficient search through a DNA vector store, reducing genome-wide alignment to a local search.

Model Details

Model Architecture

The transformer model consists of:

12 attention heads
6 encoder layers
Embedding dimension: 1020
Vocabulary size: 10,000
Cosine similarity-based sequence matching
Dropout: 0.1
Training: Cosine Annealing learning rate scheduling

Installation

To use the model, install the required dependencies:

pip install transformers torch

Usage

Load Hugging Face Model

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn as nn

def load_hf_model():
    hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
    hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
    
    class AveragePooler(nn.Module):
        def forward(self, last_hidden, attention_mask):
            return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
    
    hf_model.pooler = AveragePooler()
    return hf_model, hf_tokenizer, hf_model.pooler

###Using the Model Once the model is loaded, you can use it to obtain embeddings for DNA sequences:

def get_embedding(dna_sequence):
    model, tokenizer, pooler = load_hf_model()
    tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True)
    with torch.no_grad():
        output = model(**tokenized_input)
    embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask)
    return embedding.numpy()

# Example usage
dna_seq = "ATGCGTACGTAGCTAGCTAGC"
embedding = get_embedding(dna_seq)
print("Embedding shape:", embedding.shape)

Training Details

Dataset

The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers approximately 2% of the human genome, ensuring generalization across different sequences. Reads are generated using ART MiSeq simulation, with variations in insertion and deletion rates.

Training Procedure

Self-Supervised Learning: Contrastive loss-based training.
Dynamic Length Sequences: DNA fragments of length 800-2000 with reads sampled in [150, 500].
Noise Augmentation: 1-5% random base substitutions in 40% of training reads.
Batch Size: 16 with gradient accumulation.

Evaluation

The model was evaluated against traditional aligners (Bowtie-2) and other Transformer-based baselines (DNABERT-2, HyenaDNA). The evaluation metrics include:

Alignment Recall: >99% for high-quality reads.
Cross-Species Transfer: Successfully aligns sequences from different species, including Thermus Aquaticus and Rattus Norvegicus.

Citation

If you use this model, please cite:

@article{10.1093/bioinformatics/btaf041,
    author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani},
    title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models},
    journal = {Bioinformatics},
    pages = {btaf041},
    year = {2025},
    month = {02},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btaf041},
    url = {https://doi.org/10.1093/bioinformatics/btaf041},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf},
}

For more details, check the full paper.