Pretraining on Phosphosite Sequences with MLM Objective on ESM-1b Architecture

This repository provides a pretrained ESM-1b architecture on phosphosite sequences, where the weights are initialized from scratch and trained using the Masked Language Modeling (MLM) objective. The model was trained on long phosphosite-containing peptide sequences derived from PhosphoSitePlus.

Developed by:

Zeynep Işık (MSc, Sabanci University)

Training Details

Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Unlabeled phosphosites from PhosphoSitePlus Total Samples: 352,453 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 1.5 day

Pretraining Performance

Perplexity at Start: 13.51 Perplexity at End: 2.20 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences.

Potential Usecases

This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites

Downloads last month
25
Safetensors
Model size
652M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for isikz/esm1b_mlm_pt_phosphosite

Finetuned
(5)
this model