Pretraining on Phosphosite Sequences with MLM Objective on ESM-1b Architecture
This repository provides a pretrained ESM-1b architecture on phosphosite sequences, where the weights are initialized from scratch and trained using the Masked Language Modeling (MLM) objective. The model was trained on long phosphosite-containing peptide sequences derived from PhosphoSitePlus.
Developed by:
Zeynep Işık (MSc, Sabanci University)
Training Details
Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Unlabeled phosphosites from PhosphoSitePlus Total Samples: 352,453 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 1.5 day
Pretraining Performance
Perplexity at Start: 13.51 Perplexity at End: 2.20 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences.
Potential Usecases
This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites
- Downloads last month
- 25
Model tree for isikz/esm1b_mlm_pt_phosphosite
Base model
facebook/esm1b_t33_650M_UR50S