--- library_name: transformers license: apache-2.0 metrics: - perplexity base_model: - facebook/esm1b_t33_650M_UR50S --- ## **Pretraining on Phosphosite Sequences with MLM Objective on ESM-1b Architecture** This repository provides a pretrained ESM-1b architecture on phosphosite sequences, where the weights are initialized **from scratch** and trained using the Masked Language Modeling (MLM) objective. The model was trained on long phosphosite-containing peptide sequences derived from PhosphoSitePlus. ### **Developed by:** Zeynep Işık (MSc, Sabanci University) ### **Training Details** Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Unlabeled phosphosites from PhosphoSitePlus Total Samples: 352,453 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 1.5 day ### **Pretraining Performance** Perplexity at Start: 13.51 Perplexity at End: 2.20 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences. ### **Potential Usecases** This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites