AbLangRBD1: Contrastive-Learned Antibody Embeddings for SARS-CoV-2 RBD Binding

This repository contains the model, code, and tokenizers for AbLangRBD1.

Model Description

AbLangRBD1 is a fine-tuned antibody language model for generating embeddings of antibodies targeting the SARS-CoV-2 Receptor Binding Domain (RBD).

The model was developed using contrastive learning on paired heavy and light chain sequences, as described in our paper:

Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery. [bioRxiv], Clinton M. Holt, Alexis K. Janke, Parastoo Amlashi, Parker J. Jamieson, Toma M. Marinov, Ivelin S. Georgiev. 2025. https://doi.org/10.1101/2025.02.25.640114

Model Architecture

Heavy Chain Seq -> [AbLang Heavy] -> 768-dim -> | | -> [Concatenate] -> [Mixer Network] -> 1536-dim Paired Embedding Light Chain Seq -> [AbLang Light] -> 768-dim -> |

The AbLangRBD1 model uses the AbLangPaired architecture, a custom class that processes heavy and light chains of antibodies independently using the pre-trained AbLang models before fusing their embeddings together. The resulting embeddings from the two AbLang models are concatenated and passed through a custom Mixer network (6 fully connected feed forward layers) to produce a final, unified 1536-dimensional embedding for the paired antibody.

The pretrained heavy model is AbLang_heavy and the pretrained light model is AbLang_light. In brief, these use the RoBERTa architecture pretrained with the masked language modeling objective. Each model is 12 transformer blocks with 12 attenuated heads, an inner hidden size of 3072 and a hidden size of 768. It uses a learned positional embedding specific for antibodies with a max length of 160. The 768 dimensional embedding from each model is generated by mean pooling over all residue-level embeddings.

During training these pretrained models were frozen and a LORA adapter was added.

Intended uses & limitations

The model is intended to be used to generate epitope-information-rich embeddings of SARS-CoV-2 RBD antibodies, but a prediction head could be added to the model to make predictions such as neutralization capacity.

Epitope Classification: Antibodies with unknown epitopes can be embedded and compared against a reference database of antibodies with known epitopes. The reference antibody with the highest cosine similarity represents the most similar epitope to the epitope of the given antibody.

This can be extended to compare 2 immune repertoires following immunological challenge. For example, imagine you were testing one WT RBD vaccine versus a glycan-masked RBD vaccine. Following vaccination, RBD-specific B cell sorting, and single cell sequencing one would embed the two BCR repertoires. Then a t-SNE dimensionality reduction of these 2 repertoires as well as the training dataset of antibodies used for training of this model would be performed. One could then visually assess three plots (the reference antibodies and the two vaccine groups) side by side with epitopes being assigned relative to the region of space occupied by the training set antibodies. This would allow a rapid visualization of how proportions of B cells targeting each epitope have shifted. Limitation: Mouse BCRs are unlikely to perform well here and BCRs which do not bind the index strain are likely to have reduced classification accuracy.

Antibody Search: A reference antibody sequence can be embedded along with a large search database. Antibodies with high cosine similarities in the search database can be assumed to have similar epitope targets.
Unsupervised Clustering: To conserve resources in a discovery campaign, initial antibodies can be embedded and clustered. Representative candidates can then be chosen from each cluster for downstream characterization.

Training data

For AbLang-RBD, we utilized published deep mutational scanning data comprising 3,195 antibodies from 2 papers, of which only the 3,093 which demonstrated binding to SARS-CoV-2 index strain were kept Cao et al 2023, Cao et al 2022 . These antibodies were clustered based on heavy chain V-gene usage and CDRH3 amino acid identity >70%, with clusters distributed across training (80%), validation (10%), and test (10%) sets such that no antibodies in the same cluster existed in the training and test sets.

Training Procedure

The AbLang-RBD model was trained using a supervised contrastive learning approach to differentiate antibody embeddings based on their epitope label. Specifically, we employed the Supervised Contrastive Loss function, as introduced by Khosla et al. 2020.

How to Use

To use this model, first ensure you have the necessary libraries installed:

1. Setup

First, clone this repository and install the required libraries.

# Clone the repository to get the model script, weights, and tokenizers
git clone [https://huggingface.co/clint-holt/AbLangRBD1](https://huggingface.co/clint-holt/AbLangRBD1)
cd AbLangRBD1

# Install dependencies
pip install torch pandas "transformers>=4.30.0" safetensors

Then run the following code


import torch
import pandas as pd
from transformers import AutoTokenizer

# Import the custom model class and config from the cloned repository
from ablangpaired_model import AbLangPaired, AbLangPairedConfig

# 1. Load Model and Tokenizers
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_dir = "." # Assumes you are running this script from the cloned directory

# Configure the model to load the local weights
# The AbLangPairedConfig specifies the base AbLang models and the local checkpoint file
model_config = AbLangPairedConfig(checkpoint_filename=f"{model_dir}/model.safetensors")
model = AbLangPaired(model_config, device).to(device)
model.eval()

# Tokenizers are stored in subdirectories
heavy_tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/heavy_tokenizer")
light_tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/light_tokenizer")

# 2. Prepare Antibody Sequences
data = {
    'HC_AA': ["EVQLVESGGGFVQPGRSLRLSCAASGFIMDDYAMHWVRQAPGKGLEWVSGISWNSGTRGYADSVKGRFTVSRDNAKNSFYLQMNSLRAADTAVYYCAKDHGPWIAANGHYFDYWGQGTLVTVSS"],
    'LC_AA': ["QSVLTQPPSASGTPGQRVTISCSGSKSNIGSNPVNWYQQLPGTAPKLLIYSNNERPSGVPARFSGSKSGTSASLAISGLQSEDEADYYCVTWDDSLNGWVFGGGTKLTVL"]
}
df = pd.DataFrame(data)

# Pre-process sequences by adding spaces between amino acids
df["PREPARED_HC_SEQ"] = df["HC_AA"].apply(lambda x: " ".join(list(x)))
df["PREPARED_LC_SEQ"] = df["LC_AA"].apply(lambda x: " ".join(list(x)))

# 3. Tokenize and Embed
h_tokens = heavy_tokenizer(df["PREPARED_HC_SEQ"].tolist(), padding='longest', return_tensors="pt")
l_tokens = light_tokenizer(df["PREPARED_LC_SEQ"].tolist(), padding='longest', return_tensors="pt")

with torch.no_grad():
    embeddings = model(
        h_input_ids=h_tokens['input_ids'].to(device),
        h_attention_mask=h_tokens['attention_mask'].to(device),
        l_input_ids=l_tokens['input_ids'].to(device),
        l_attention_mask=l_tokens['attention_mask'].to(device)
    )

print("Embedding generation complete! ✅")
print("Shape of embeddings tensor:", embeddings.shape)
# Expected output shape: (1, 1536)

Citation

If you use this model or code in your research, please cite our paper:


@article {Holt2025.02.25.640114,
    author = {Holt, Clinton M. and Janke, Alexis K. and Amlashi, Parastoo and Jamieson, Parker J. and Marinov, Toma M. and Georgiev, Ivelin S.},
    title = {Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery},
    elocation-id = {2025.02.25.640114},
    year = {2025},
    doi = {10.1101/2025.02.25.640114},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114},
    eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114.full.pdf},
    journal = {bioRxiv}

}