isikz commited on
Commit
71d5cfe
·
verified ·
1 Parent(s): 5b6f5e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -9,7 +9,7 @@ base_model:
9
 
10
  ## **Fine-Tuning ESM-1b with Multiple Sequence Alignment (MSA) for Phosphosites**
11
 
12
- This repository provides a fine-tuned version of ESM-1b, incorporating genomic information by leveraging long phosphosite sequences from [DARKIN dataset](https://openreview.net/pdf?id=a4x5tbYRYV) and Multiple Sequence Alignment (MSA) of those phosphosites. The goal is to enhance the model's understanding of phosphorylation by integrating sequence conservation patterns.
13
 
14
  ### Developed by:
15
 
@@ -26,19 +26,20 @@ To construct a robust dataset, we extracted 256 MSA sequences per phosphosite fr
26
  - 10% of the data was reserved for validation.
27
  - The remaining 90% was used for fine-tuning with the Masked Language Modeling (MLM) objective.
28
  3. Data Processing & Preprocessing
29
- - Special attention was given to retaining phosphorylation residues within sequences.
30
  - To optimize memory efficiency, sequence lengths were truncated to 128 amino acids.
31
 
32
  ### Evaluation
33
 
34
  Perplexity: 2.69 (decreased from 7.05)
35
 
36
- from transformers import AutoTokenizer, AutoModelForMaskedLM
37
- import torch
38
 
39
  ### Usage
40
 
41
  ```
 
 
 
42
  # Load the model and tokenizer
43
  model_name = "isikz/phosphosite_msa_finetuned_esm1b"
44
  tokenizer = AutoTokenizer.from_pretrained(model_name)
 
9
 
10
  ## **Fine-Tuning ESM-1b with Multiple Sequence Alignment (MSA) for Phosphosites**
11
 
12
+ This repository provides a fine-tuned version of ESM-1b with Masked Language Modeling(MLM) Objective, incorporating genomic information by leveraging long phosphosite sequences from [DARKIN dataset](https://openreview.net/pdf?id=a4x5tbYRYV) and Multiple Sequence Alignment (MSA) of those phosphosites. The goal is to enhance the model's understanding of phosphorylation by integrating sequence conservation patterns.
13
 
14
  ### Developed by:
15
 
 
26
  - 10% of the data was reserved for validation.
27
  - The remaining 90% was used for fine-tuning with the Masked Language Modeling (MLM) objective.
28
  3. Data Processing & Preprocessing
29
+ - Special attention was given to conserving phosphorylation residues within sequences.
30
  - To optimize memory efficiency, sequence lengths were truncated to 128 amino acids.
31
 
32
  ### Evaluation
33
 
34
  Perplexity: 2.69 (decreased from 7.05)
35
 
 
 
36
 
37
  ### Usage
38
 
39
  ```
40
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
41
+ import torch
42
+
43
  # Load the model and tokenizer
44
  model_name = "isikz/phosphosite_msa_finetuned_esm1b"
45
  tokenizer = AutoTokenizer.from_pretrained(model_name)