InstaNovoPlus: Diffusion-Powered De novo Peptide Sequencing Model

Model Description

InstaNovoPlus is a diffusion-based model for de novo peptide sequencing from mass spectrometry data. This model leverages multinomial diffusion for accurate, database-free peptide identification for large-scale proteomics experiments.

Usage

import torch
import numpy as np
import pandas as pd
from instanovo.diffusion.multinomial_diffusion import InstaNovoPlus
from instanovo.utils import SpectrumDataFrame
from instanovo.transformer.dataset import SpectrumDataset, collate_batch
from torch.utils.data import DataLoader
from instanovo.inference import ScoredSequence
from instanovo.inference.diffusion import DiffusionDecoder
from instanovo.utils.metrics import Metrics
from tqdm.notebook import tqdm

# Load the model from the Hugging Face Hub
model, config = InstaNovoPlus.from_pretrained("InstaDeepAI/instanovoplus-v1.1.0")

# Move the model to the GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

# Update the residue set with custom modifications
model.residue_set.update_remapping(
    {
        "M(ox)": "M[UNIMOD:35]",
        "M(+15.99)": "M[UNIMOD:35]",
        "S(p)": "S[UNIMOD:21]",  # Phosphorylation
        "T(p)": "T[UNIMOD:21]",
        "Y(p)": "Y[UNIMOD:21]",
        "S(+79.97)": "S[UNIMOD:21]",
        "T(+79.97)": "T[UNIMOD:21]",
        "Y(+79.97)": "Y[UNIMOD:21]",
        "Q(+0.98)": "Q[UNIMOD:7]",  # Deamidation
        "N(+0.98)": "N[UNIMOD:7]",
        "Q(+.98)": "Q[UNIMOD:7]",
        "N(+.98)": "N[UNIMOD:7]",
        "C(+57.02)": "C[UNIMOD:4]",  # Carboxyamidomethylation
        "(+42.01)": "[UNIMOD:1]",  # Acetylation
        "(+43.01)": "[UNIMOD:5]",  # Carbamylation
        "(-17.03)": "[UNIMOD:385]",
    }
)

# Load the test data
sdf = SpectrumDataFrame.from_huggingface(
    "InstaDeepAI/ms_ninespecies_benchmark",
    is_annotated=True,
    shuffle=False,
    split="test[:10%]",  # Let's only use a subset of the test data for faster inference
)

# Create the dataset
ds = SpectrumDataset(
    sdf,
    model.residue_set,
    config.get("n_peaks", 200),
    return_str=False,
    annotated=True,
    peptide_pad_length=model.config.get("max_length", 30),
    reverse_peptide=False,  # we do not reverse peptide for diffusion
    add_eos=False,
    tokenize_peptide=True,
)

# Create the data loader
dl = DataLoader(
    ds,
    batch_size=64,
    num_workers=0,  # sdf requirement, handled internally
    shuffle=False,  # sdf requirement, handled internally
    collate_fn=collate_batch,
)

# Create the decoder
diffusion_decoder = DiffusionDecoder(model=model)

predictions = []
log_probs = []

# Iterate over the data loader
for batch in tqdm(dl, total=len(dl)):
    spectra, precursors, spectra_padding_mask, peptides, _ = batch
    spectra = spectra.to(device)
    precursors = precursors.to(device)
    spectra_padding_mask = spectra_padding_mask.to(device)
    peptides = peptides.to(device)

    # Perform inference
    with torch.no_grad():
        batch_predictions, batch_log_probs = diffusion_decoder.decode(
            spectra=spectra,
            spectra_padding_mask=spectra_padding_mask,
            precursors=precursors,
            initial_sequence=peptides,
        )
    predictions.extend(batch_predictions)
    log_probs.extend(batch_log_probs)

# Initialize metrics
metrics = Metrics(model.residue_set, config["isotope_error_range"])

# Compute precision and recall
aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
    peptides, preds
)

# Compute amino acid error rate and AUC
aa_error_rate = metrics.compute_aa_er(targs, preds)
auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))

print(f"amino acid error rate:    {aa_error_rate:.5f}")
print(f"amino acid precision:     {aa_precision:.5f}")
print(f"amino acid recall:        {aa_recall:.5f}")
print(f"peptide precision:        {peptide_precision:.5f}")
print(f"peptide recall:           {peptide_recall:.5f}")
print(f"area under the PR curve:  {auc:.5f}")

For more explanation, see the Getting Started notebook in the repository.

Citation

If you use InstaNovoPlus in your research, please cite:

@article{eloff_kalogeropoulos_2025_instanovo,
        title        = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
                        proteomics experiments},
        author       = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
                        Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
                        Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
                        and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
                        Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
                        Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
        year         = {2025},
        month        = {Mar},
        day          = {31},
        journal      = {Nature Machine Intelligence},
        doi          = {10.1038/s42256-025-01019-5},
        issn         = {2522-5839},
        url          = {https://doi.org/10.1038/s42256-025-01019-5}
}

Resources

Code Repository: https://github.com/instadeepai/InstaNovo
Documentation: https://instadeepai.github.io/InstaNovo/
Publication: https://www.nature.com/articles/s42256-025-01019-5

License

Code: Licensed under Apache License 2.0
Model Checkpoints: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)

Installation

pip install instanovo

For GPU support, install with CUDA dependencies:

pip install instanovo[cu126]

Requirements

Python >= 3.10, < 3.13
PyTorch >= 1.13.0
CUDA (optional, for GPU acceleration)

Support

For questions, issues, or contributions, please visit the GitHub repository or check the documentation.

Downloads last month: 65

Safetensors

Model size

0.2B params

Tensor type

F64

F32

InstaDeepAI
/

instanovoplus-v1.1.0