--- library_name: transformers license: mit pipeline_tag: sentence-similarity --- # DNA2Vec: Transformer-Based DNA Sequence Embedding This repository provides an implementation of `dna2vec`, a transformer-based model designed for DNA sequence embeddings. It includes both the Hugging Face (`hf_model`) and a locally trained model (`local_model`). The model can be used for DNA sequence alignment, classification, and other genomic applications. ## Model Overview DNA sequence alignment is an essential genomic task that involves mapping short DNA reads to the most probable locations within a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advances leverage transformer-based models to encode DNA sequences into vector representations. The `dna2vec` framework introduces a **Reference-Free DNA Embedding (RDE) Transformer model**, which encodes DNA sequences into a shared vector space, allowing for efficient similarity search and sequence alignment. ### Key Features: - **Transformer-based architecture** trained on genomic data. - **Reference-free embeddings** that enable efficient sequence retrieval. - **Contrastive loss for self-supervised training**, ensuring robust sequence similarity learning. - **Support for Hugging Face and custom-trained local models**. - **Efficient search through a DNA vector store**, reducing genome-wide alignment to a local search. ## Model Details ### Model Architecture The transformer model consists of: - **12 attention heads** - **6 encoder layers** - **Embedding dimension:** 1020 - **Vocabulary size:** 10,000 - **Cosine similarity-based sequence matching** - **Dropout:** 0.1 - **Training: Cosine Annealing learning rate scheduling** ## Installation To use the model, install the required dependencies: ```bash pip install transformers torch ``` ## Usage ### Load Hugging Face Model ```python from transformers import AutoModel, AutoTokenizer import torch import torch.nn as nn def load_hf_model(): hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True) hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True) class AveragePooler(nn.Module): def forward(self, last_hidden, attention_mask): return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1) hf_model.pooler = AveragePooler() return hf_model, hf_tokenizer, hf_model.pooler ``` ###Using the Model Once the model is loaded, you can use it to obtain embeddings for DNA sequences: ```python def get_embedding(dna_sequence): model, tokenizer, pooler = load_hf_model() tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True) with torch.no_grad(): output = model(**tokenized_input) embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask) return embedding.numpy() # Example usage dna_seq = "ATGCGTACGTAGCTAGCTAGC" embedding = get_embedding(dna_seq) print("Embedding shape:", embedding.shape) ``` ## Training Details ### Dataset The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers **approximately 2% of the human genome**, ensuring generalization across different sequences. Reads are generated using **ART MiSeq** simulation, with variations in insertion and deletion rates. ### Training Procedure - **Self-Supervised Learning:** Contrastive loss-based training. - **Dynamic Length Sequences:** DNA fragments of length 800-2000 with reads sampled in [150, 500]. - **Noise Augmentation:** 1-5% random base substitutions in 40% of training reads. - **Batch Size:** 16 with gradient accumulation. ## Evaluation The model was evaluated against traditional aligners (Bowtie-2) and other Transformer-based baselines (DNABERT-2, HyenaDNA). The evaluation metrics include: - **Alignment Recall:** >99% for high-quality reads. - **Cross-Species Transfer:** Successfully aligns sequences from different species, including *Thermus Aquaticus* and *Rattus Norvegicus*. ## Citation If you use this model, please cite: ```bibtex @article{10.1093/bioinformatics/btaf041, author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani}, title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models}, journal = {Bioinformatics}, pages = {btaf041}, year = {2025}, month = {02}, issn = {1367-4811}, doi = {10.1093/bioinformatics/btaf041}, url = {https://doi.org/10.1093/bioinformatics/btaf041}, eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf}, } ``` For more details, check the [full paper](https://arxiv.org/abs/2309.11087v6).