svincoff commited on Jan 20

Commit

3efa812

1 Parent(s): e048d40

mutation prediction discovery and recovery

Browse files

Files changed (40) hide show

fuson_plm/benchmarking/caid/README.md +2 -2
fuson_plm/benchmarking/idr_prediction/README.md +2 -2
fuson_plm/benchmarking/mutation_prediction/README.md +114 -0
fuson_plm/benchmarking/mutation_prediction/discovery/clean.py +71 -0
fuson_plm/benchmarking/mutation_prediction/discovery/color_discovered_mutations.ipynb +418 -0
fuson_plm/benchmarking/mutation_prediction/discovery/config.py +16 -0
fuson_plm/benchmarking/mutation_prediction/discovery/discover.py +346 -0
fuson_plm/benchmarking/mutation_prediction/discovery/make_color_bar.py +25 -0
fuson_plm/benchmarking/mutation_prediction/discovery/plot.py +167 -0
fuson_plm/benchmarking/mutation_prediction/discovery/processed_data/521_logit_bfactor.cif +0 -0
pytorch_model.bin → fuson_plm/benchmarking/mutation_prediction/discovery/processed_data/domain_conservation_fusions_inputfile.csv +2 -2
fuson_plm/benchmarking/mutation_prediction/discovery/processed_data/test_seqs_tftf_kk.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/raw_data/salokas_2020_tableS3.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/conservation_heatmap.png +0 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/full_results_with_logits.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/predicted_tokens.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/raw_mutation_results.pkl +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/conservation_heatmap.png +0 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/full_results_with_logits.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/predicted_tokens.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/raw_mutation_results.pkl +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/conservation_heatmap.png +0 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/full_results_with_logits.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/predicted_tokens.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/raw_mutation_results.pkl +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/conservation_heatmap.png +0 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/full_results_with_logits.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/predicted_tokens.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/raw_mutation_results.pkl +3 -0
fuson_plm/benchmarking/mutation_prediction/discovery/viridis_color_bar.png +0 -0
fuson_plm/benchmarking/mutation_prediction/recovery/abl_mutations.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/recovery/alk_mutations.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/recovery/color_recovered_mutations_public.ipynb +314 -0
fuson_plm/benchmarking/mutation_prediction/recovery/config.py +3 -0
fuson_plm/benchmarking/mutation_prediction/recovery/recover_public.py +330 -0
fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/Supplementary Tables - EML4 ALK Mutations.csv +3 -0
fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/Supplementary Tables - BCR ABL Mutations.csv +3 -0
fuson_plm/benchmarking/puncta/README.md +1 -1

fuson_plm/benchmarking/caid/README.md CHANGED Viewed

@@ -158,8 +158,8 @@ Here we describe what each script does and which files each script creates.
         a. Drops rows where no structure was successfully downloaded
         b. Drops rows where the FO sequence from FusionPDB does not match the FO sequence from its own AlphaFold2 structure file
         c. ⭐️Saves **two final, cleaned databases**⭐️:
-            a. ⭐️ **`FusionPDB_level2-3_cleaned_FusionGID_info.csv`**: includes ful IDs and structural information for the Hgene and Tgene of each FO.  Columns="FusionGID","FusionGene","Hgene","Tgene","URL","HGID","TGID","HGUniProtAcc","TGUniProtAcc","HGUniProtAcc_Source","TGUniProtAcc_Source","HG_pLDDT","HG_AA_pLDDTs","HG_Seq","TG_pLDDT","TG_AA_pLDDTs","TG_Seq".
-            b. ⭐️ **`FusionPDB_level2-3_cleaned_structure_info.csv`**: includes full structural information for each FO. Columns= "FusionGID","FusionGene","Fusion_Seq","Fusion_Length","Hgene","Hchr","Hbp","Hstrand","Tgene","Tchr","Tbp","Tstrand","Level","Fusion_Structure_Link","Fusion_Structure_Type","Fusion_pLDDT","Fusion_AA_pLDDTs","Fusion_Seq_Source"
 ### Training

         a. Drops rows where no structure was successfully downloaded
         b. Drops rows where the FO sequence from FusionPDB does not match the FO sequence from its own AlphaFold2 structure file
         c. ⭐️Saves **two final, cleaned databases**⭐️:
+            a. ⭐️ **`FusionPDB_level2-3_cleaned_FusionGID_info.csv`**: includes ful IDs and structural information for the Hgene and Tgene of each FO.  Columns = "FusionGID", "FusionGene", "Hgene", "Tgene", "URL", "HGID", "TGID", "HGUniProtAcc", "TGUniProtAcc", "HGUniProtAcc_Source", "TGUniProtAcc_Source", "HG_pLDDT", "HG_AA_pLDDTs", "HG_Seq", "TG_pLDDT", "TG_AA_pLDDTs", "TG_Seq".
+            b. ⭐️ **`FusionPDB_level2-3_cleaned_structure_info.csv`**: includes full structural information for each FO. Columns = "FusionGID", "FusionGene", "Fusion_Seq", "Fusion_Length", "Hgene", "Hchr", "Hbp", "Hstrand", "Tgene", "Tchr", "Tbp", "Tstrand", "Level", "Fusion_Structure_Link", "Fusion_Structure_Type", "Fusion_pLDDT", "Fusion_AA_pLDDTs", "Fusion_Seq_Source"
 ### Training

fuson_plm/benchmarking/idr_prediction/README.md CHANGED Viewed

@@ -14,7 +14,7 @@ python plot.py              # if you want to remake r2 plots
 ```
 ### Downloading raw IDR data
-IDR properties from [Lotthammer et al. 2024](https://doi.org/10.1038/s41592-023-02159-5) (ALBATROSS model) were used to train FusOn-pLM-Diso. Sequences were downloaded from [this link](https://github.com/holehouse-lab/supportingdata/blob/master/2023/ALBATROSS_2023/simulations/data/all_sequences.tgz) and deposited in `raw_data`. All files in `raw_data` are from this direct download.
 ```
 benchmarking/
@@ -144,7 +144,7 @@ The model is defined in `model.py` and `utils.py`. The `train.py` script trains
 - All **raw outputs from models** are stored in `idr_prediction/trained_models/<embedding_path>`, where `embedding_path` represents the embeddings used to build the disorder predictor.
 - All **embeddings** made for training will be stored in a new folder called `idr_prediction/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
-Below is the FusOn-pLM-Diso raw outputs folder, `trained_models/fuson_plm/best/`, and the results from the paper, `results/final/`...
 The outputs are structured as follows:

 ```
 ### Downloading raw IDR data
+IDR properties from [Lotthammer et al. 2024](https://doi.org/10.1038/s41592-023-02159-5) (ALBATROSS model) were used to train FusOn-pLM-IDR. Sequences were downloaded from [this link](https://github.com/holehouse-lab/supportingdata/blob/master/2023/ALBATROSS_2023/simulations/data/all_sequences.tgz) and deposited in `raw_data`. All files in `raw_data` are from this direct download.
 ```
 benchmarking/
 - All **raw outputs from models** are stored in `idr_prediction/trained_models/<embedding_path>`, where `embedding_path` represents the embeddings used to build the disorder predictor.
 - All **embeddings** made for training will be stored in a new folder called `idr_prediction/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
+Below is the FusOn-pLM-IDR raw outputs folder, `trained_models/fuson_plm/best/`, and the results from the paper, `results/final/`...
 The outputs are structured as follows:

fuson_plm/benchmarking/mutation_prediction/README.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# Mutation Prediction benchmark
+This folder contains all the data and code needed to perform the **mutation prediction benchmark** (Figure 5).
+## Recovery
+In Fig. 5C, drug resistance mutations in BCR::ABL and EML4::ALK are recovered by FusOn-pLM. Since the full sequences are not publicly available, certain sections of the code and results files have been removed. All results for drug resistance mutation positions in each sequence are included in `mutation_prediction/recovery`.
+```
+benchmarking/
+└── mutation_prediction/
+    └── recovery/
+        └── results/final_public/
+            ├──  BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv
+            ├──  Supplementary Tables - BCR ABL Mutations.csv
+            ├──  EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv
+            ├──  Supplementary Tables -  EML4 ALK Mutations.csv
+        ├── abl_mutations.csv
+        ├── alk_mutations.csv
+        ├── color_recovered_mutations_public.ipynb
+        ├── recover_public.py
+        ├── config.py
+```
+In the 📁 **`results`** directory:
+- **`abl_mutations.csv`**: raw data from the literature with BCR::ABL mutations [(O'Hare et al. 2007)](https://doi.org/10.1182/blood-2007-03-066936)
+- **`alk_mutations.csv`**: raw data from the literature with EML4::ALK mutations [(Elshatlawy et al. 2023)](https://doi.org/10.1002/1878-0261.13446)
+- **`color_recovered_mutations_public.csv`**: notebook file used to write PyMOL code for visualizations in Fig. 5C
+- **`recover_public.py`**: python file to run the analysis. Since private sequences are removed, this notebook will not run, but it includes all steps taken.
+- **`config.py`**: small config file to specify which FusOn-pLM checkpoint is being used.
+In the 📁 **`results/final_public`** directory,
+- **`BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv`**: raw logits for every possible mutation at all positions in BCR::ABL with known drug resistance mutations
+- **`Supplementary Tables - BCR ABL Mutations.csv`**: supplementary table showing the calculations going into the hit rate for BCR::ABL
+- **`EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv`**: raw logits for every possible mutation at all positions in EML4::ALK with known drug resistance mutations
+- **`Supplementary Tables -  EML4 ALK Mutations.csv`**: supplementary table showing the calculations going into the hit rate for EML4::ALK
+## Discovery
+The `mutation_prediction/discovery/` directory contains all code and files needed to reproduce Fig. 5B and 5D, where mutations are predicted at each position in several fusion oncoproteins.
+### Data download
+To help select TF and Kinase-containing fusions for investigation (Fig. 5B), Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8) was downloaded as a reference of transcription factors and kinases. In `clean.py`, this data is processed.
+```
+benchmarking/
+└── mutation_prediction/
+    └── discovery/
+        └── raw_data/
+            ├── salokas_2020_tableS3.csv
+        └── processed_data/
+            ├── test_seqs_tftf_kk.csv
+            ├── domain_conservation_fusions_inputfile.csv
+```
+- **`raw_data/salokas_2020_tableS3.csv`**: Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8)
+- **`processed_data/test_seqs_tftf_kk.csv`**: fusion oncoproteins in the FusOn-pLM test set that have either a Transcription Factor (TF) or a kinase as either the head or tail.
+- **`processed_data/domain_conservation_fusions_inputfile.csv`**: input file for discovery, containing the longest sequences of EWSR1::FLI1, PAX3::FOXO1, TRIM24::RET, and ETV6::NTRK3.
+### Mutation discovery
+Run `discover.py` to perform discovery on the sequences specified in `config.py` (either one or many):
+```
+# Model settings: where is the model you wish to use for mutation discovery?
+FUSON_PLM_CKPT = "FusOn-pLM"
+#### Fill in this sectinon if you have one input
+# Sequence settings: need full sequence of fusion oncoprotein, and the bounds of region of interest
+FULL_FUSION_SEQUENCE = ""
+FUSION_NAME = "fusion_example"
+START_RESIDUE_INDEX = 1
+END_RESIDUE_INDEX = 100
+N = 3           # number of mutations to predict per amio acid
+#### Fill in this section if you have multiple input
+PATH_TO_INPUT_FILE = "processed_data/domain_conservation_fusions_inputfile.csv"   # if you don't have an input file and want to do one sequence, set this variable to None
+# GPU Settings: which GPUs should be available to run this discovery?
+CUDA_VISIBLE_DEVICES = "0"
+```
+To run, use:
+```
+nohup python discover.py > discover.out 2> discover.err &
+```
+- All **results** are stored in `idr_prediction/results/<timestamp>`, where `timestamp` is a unique string encoding the date and time when you started training.
+Below are the FusOn-pLM paper results in `results/final`:
+```
+benchmarking/
+└── mutation_prediction/
+    └── discovery/
+        └── results/final/
+            └── EWSR1::FLI1/
+                ├── conservation_heatmap.png
+                ├── full_results_with_logits.csv
+                ├── predicted_tokens.csv
+                ├── raw_mutation_results.pkl
+            └── PAX3::FOXO1/        # same format as results/final/EWSR1::FLI1...
+            └── TRIM24::RET/        # same format as results/final/EWSR1::FLI1...
+            └── ETV6::NTRK3/        # same format as results/final/EWSR1::FLI1...
+```
+In each fusion oncoprotein folder are:
+- **`conservation_heatmap.png`**: the heatmap from Fig. 5B
+- **`full_results_with_logits.csv`**: predictions for every residue in the sequence. Columns = "Residue", "original_residue", "original_residue_logit", "all_logits", "top_3_mutations"
+- **`predicted_tokens.csv`**: simplified format, top three tokens per residue and 1/0 conserved/not conserved label. Columns = "Original Residue", "Predicted Residues", "Conserved", "Position"
+- **`raw_mutation_results.pkl`**: raw logits in dictionary format.
+### Plotting
+Three scripts aid with plotting:
+1. `make_color_bar.py`: can be run to generate `viridis_color_bar.png`, a labeled and scaled up version of the color bar used in the conservation heatmaps and to color an ETV6::NTRK3 structure (Fig. 5D)
+2. `plot.py`: includes code for heatmap generation. Can be run to regenerate heatmaps.
+3. `color_discovered_mutations.ipynb`: notebook used to generate a modified version of the ETV6::NTRK3 structure file, which has the logits predicted by FusOn-pLM as the b factor (`processed_data/521_logit_bfactor.cif`, displayed in Fig. 5D, right). Also has PyMOL code for a head/tail visualization with recovered drug resistance mutations (displayed in Fig. 5D, left)

fuson_plm/benchmarking/mutation_prediction/discovery/clean.py ADDED Viewed

	@@ -0,0 +1,71 @@

+### Clean the Salokas data, find TF and Kinase fusions in the test set
+import pandas as pd
+import os
+def get_gene_type(gene, d):
+    if gene in d:
+        if d[gene] == 'kinase':
+            return 'Kinase'
+        if d[gene] == 'tf':
+            return 'TF'
+    else:
+        return 'Other'
+# Load TF and Kinase Fusions
+def main():
+    os.makedirs("processed_data", exist_ok=True)
+    tf_kinase_parts = pd.read_csv("raw_data/salokas_2020_tableS3.csv")
+    print(tf_kinase_parts)
+    ht_tf_kinase_dict = dict(zip(tf_kinase_parts['Gene'],tf_kinase_parts['Kinase or TF']))
+    ## Categorize everything in fuson_db
+    fuson_db = pd.read_csv("../../../data/fuson_db.csv")
+    print(fuson_db['benchmark'].value_counts())
+    print(fuson_db.loc[fuson_db['benchmark'].notna()])
+    fgenes = fuson_db.loc[fuson_db['benchmark'].notna()]['fusiongenes'].to_list()
+    print(fuson_db.columns)
+    print(fuson_db)
+    # This one has each row with one fusiongene name
+    fuson_ht_db = pd.read_csv("../../../data/blast/fuson_ht_db.csv")
+    print(fuson_ht_db.columns)
+    print(fuson_ht_db)
+    fuson_ht_db[['hg','tg']] = fuson_ht_db['fusiongenes'].str.split("::",expand=True)
+    print(fuson_ht_db.loc[fuson_ht_db['hg']=='PAX3'])
+    print(fuson_ht_db)
+    fuson_ht_db['hg_type'] = fuson_ht_db['hg'].apply(lambda x: get_gene_type(x, ht_tf_kinase_dict))
+    fuson_ht_db['tg_type'] = fuson_ht_db['tg'].apply(lambda x: get_gene_type(x, ht_tf_kinase_dict))
+    fuson_ht_db['fusion_type'] = fuson_ht_db['hg_type']+'::'+fuson_ht_db['tg_type']
+    fuson_ht_db['type']=['fusion']*len(fuson_ht_db)
+    # Keep things in the test set
+    test_set = pd.read_csv("../../../data/splits/test_df.csv")
+    print(test_set.columns, len(test_set))
+    test_seqs = test_set['sequence'].tolist()
+    fuson_ht_db = fuson_ht_db.loc[
+        fuson_ht_db['aa_seq'].isin(test_seqs)
+    ].sort_values(by=['fusion_type']).reset_index(drop=True)
+    fuson_ht_db.to_csv("processed_data/test_seqs_tftf_kk.csv", index=False)
+    # isolate a few transcription factor fusions of interest and keep the longest sequence of each
+    fusion_genes_of_interest = [
+        "EWSR1::FLI1", "PAX3::FOXO1", "TRIM24::RET", "ETV6::NTRK3"
+    ]
+    df_of_interest = fuson_ht_db.loc[
+        fuson_ht_db['fusiongenes'].isin(fusion_genes_of_interest)
+    ].sort_values(by=['fusiongenes','length'],ascending=[True,False]).reset_index(drop=True).drop_duplicates(subset='fusiongenes').reset_index(drop=True)
+    #df_of_interest.to_csv("domain_conservation_fusions.csv",index=False)
+    # Make a file for input into
+    discovery_input = df_of_interest[['fusiongenes','length','aa_seq']]
+    discovery_input['start_residue_index'] = [1]*len(discovery_input)
+    discovery_input['n'] = [3]*len(discovery_input)
+    discovery_input = discovery_input.rename(columns={'length':'end_residue_index',
+                                                    'aa_seq': 'full_fusion_sequence',
+                                                    'fusiongenes':'fusion_name'})
+    discovery_input[['fusion_name','full_fusion_sequence','start_residue_index','end_residue_index','n']].to_csv("processed_data/domain_conservation_fusions_inputfile.csv",index=False)
+    print(discovery_input)
+if __name__ == "__main__":
+    main()

fuson_plm/benchmarking/mutation_prediction/discovery/color_discovered_mutations.ipynb ADDED Viewed

	@@ -0,0 +1,418 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FJd6a9gdZNjG"
+   },
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install torch pandas numpy py3Dmol scikit-learn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {
+    "id": "ZEWZVc9lUxjI"
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "import pickle\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "import py3Dmol\n",
+    "\n",
+    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, average_precision_score"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.getcwd()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Look at all results for ETV6::NTRK3 discovery"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "path_to_pkl = \"../discovery/results/final/ETV6::NTRK3/raw_mutation_results.pkl\"\n",
+    "with open(path_to_pkl, \"rb\") as f:\n",
+    "    etv6_ntrk3_logits = pickle.load(f)\n",
+    "\n",
+    "print(etv6_ntrk3_logits)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define paths and dataframes that we will need \n",
+    "fusion_benchmark_set = pd.read_csv('../../caid/splits/fusion_bench_df.csv')\n",
+    "fusion_structure_data = pd.read_csv('../../caid/processed_data/fusionpdb/FusionPDB_level2-3_cleaned_structure_info.csv')\n",
+    "fusion_structure_data['Fusion_Structure_Link'] = fusion_structure_data['Fusion_Structure_Link'].apply(lambda x: x.split('/')[-1])\n",
+    "fusion_structure_folder = \"raw_data/fusionpdb/structures\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# merge fusion data with seq ids \n",
+    "fuson_db = pd.read_csv('../../../data/fuson_db.csv')\n",
+    "print(fuson_db.columns)\n",
+    "fuson_db = fuson_db[['aa_seq','seq_id']].rename(columns={'aa_seq':'Fusion_Seq'})\n",
+    "print(f\"Length of fusion structure data before merge on seqid: {len(fusion_structure_data)}\")\n",
+    "fusion_structure_data = pd.merge(\n",
+    "    fusion_structure_data,\n",
+    "    fuson_db,\n",
+    "    on='Fusion_Seq',\n",
+    "    how='inner'\n",
+    ")\n",
+    "print(f\"Length of fusion structure data after merge on seqid: {len(fusion_structure_data)}\")\n",
+    "fusion_structure_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# merge fusion structure data with top swissprot alignments\n",
+    "swissprot_top_alignments = pd.read_csv(\"../../../data/blast/blast_outputs/swissprot_top_alignments.csv\")\n",
+    "fusion_structure_data = pd.merge(\n",
+    "    fusion_structure_data,\n",
+    "    swissprot_top_alignments,\n",
+    "    on=\"seq_id\",\n",
+    "    how=\"left\"\n",
+    ")\n",
+    "fusion_structure_data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "selected_row = fusion_structure_data.loc[\n",
+    "    (fusion_structure_data['FusionGene'].str.contains('ETV6::NTRK3')) &\n",
+    "    (fusion_structure_data['aa_seq_len']==528)\n",
+    "].reset_index(drop=True).iloc[0,:]\n",
+    "selected_row"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seq = selected_row[\"Fusion_Seq\"]\n",
+    "kinase_seq = \"IVLKRELGEGAFGKVFLAECYNLSPTKDKMLVAVKALKDPTLAARKDFQREAELLTNLQHEHIVKFYGVCGDGDPLIMVFEYMKHGDLNKFLRAHGPDAMILVDGQPRQAKGELGLSQMLHIASQIASGMVYLASQHFVHRDLATRNCLVGANLLVKIGDFGMSRDVYSTDYYRLFNPSGNDFCIWCEVGGHTMLPIRWMPPESIMYRKFTTESDVWSFGVILWEIFTYGKQPWFQLSNTEVIECITQGRVLERPRVCPKEVYDVMLGCWQREPQQRLNIKEIYKILHALGKATPIYLDILG\"\n",
+    "print(\"Length of kinase domain: \", len(kinase_seq))\n",
+    "print(\"1-indexed start position of kinase domain\", seq.index(kinase_seq)-1)\n",
+    "print(\"1-indexed end position of kinase domain (inclusive)\", seq.index(kinase_seq)-1+len(kinase_seq)-1)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kinase_seq = \"IVLKRELGEGAFGKVFLAECYNLSPTKDKMLVAVKALKDPTLAARKDFQREAELLTNLQHEHIVKFYGVCGDGDPLIMVFEYMKHGDLNKFLRAHGPDAMILVDGQPRQAKGELGLSQMLHIASQIASGMVYLASQHFVHRDLATRNCLVGANLLVKIGDFGMSRDVYSTDYYRLFNPSGNDFCIWCEVGGHTMLPIRWMPPESIMYRKFTTESDVWSFGVILWEIFTYGKQPWFQLSNTEVIECITQGRVLERPRVCPKEVYDVMLGCWQREPQQRLNIKEIYKILHALGKATPIYLDILG\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "import torch.nn.functional as F\n",
+    "fuson_ckpt_path = \"../../../..\"\n",
+    "fuson_tokenizer = AutoTokenizer.from_pretrained(fuson_ckpt_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(len(etv6_ntrk3_logits['originals_logits']))\n",
+    "print(len(etv6_ntrk3_logits['conservation_likelihoods']))\n",
+    "print(len(etv6_ntrk3_logits['logits_for_each_AA']))\n",
+    "\n",
+    "start = etv6_ntrk3_logits['start']\n",
+    "end = etv6_ntrk3_logits['end']\n",
+    "originals_logits = etv6_ntrk3_logits['originals_logits']\n",
+    "conservation_likelihoods = etv6_ntrk3_logits['conservation_likelihoods']\n",
+    "logits = etv6_ntrk3_logits['logits']\n",
+    "logits_for_each_AA = etv6_ntrk3_logits['logits_for_each_AA']\n",
+    "filtered_indices = etv6_ntrk3_logits['filtered_indices']\n",
+    "top_n_mutations = etv6_ntrk3_logits['top_n_mutations']\n",
+    "\n",
+    "token_indices = torch.arange(logits.size(-1))\n",
+    "tokens = [fuson_tokenizer.decode([idx]) for idx in token_indices]\n",
+    "filtered_tokens = [tokens[i] for i in filtered_indices]\n",
+    "all_logits_array = np.vstack(logits_for_each_AA)\n",
+    "normalized_logits_array = F.softmax(torch.tensor(all_logits_array), dim=-1).numpy()\n",
+    "transposed_logits_array = normalized_logits_array.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# add these logits as a b factor to the pdb file \n",
+    "import re\n",
+    "\n",
+    "path_to_cif = f\"../../caid/raw_data/fusionpdb/structures/{selected_row['Fusion_Structure_Link']}\"\n",
+    "path_to_modified_cif = f\"{selected_row['Fusion_Structure_Link'].split('.')[0]}_logit_bfactor.cif\"\n",
+    "\n",
+    "def modify(path_to_cif, path_to_modified_cif, new_b_values):\n",
+    "    with open(path_to_cif, 'r') as f:\n",
+    "        lines = f.readlines()\n",
+    "\n",
+    "    exline = ''\n",
+    "    with open(path_to_modified_cif, 'w') as f:\n",
+    "        in_aa_loop=False\n",
+    "        in_atom_loop=False\n",
+    "        done_aa=False\n",
+    "        done_atom=False\n",
+    "        counter_aa=0\n",
+    "        counter_atom=0\n",
+    "        seqlen=len(selected_row['Fusion_Seq'])\n",
+    "        for line in lines:\n",
+    "            newline=line\n",
+    "            if line.startswith(\"A\\t\") and not(done_aa):\n",
+    "                in_aa_loop=True\n",
+    "            else:\n",
+    "                in_aa_loop=False\n",
+    "            if line.startswith(\"ATOM \") and not(done_atom):\n",
+    "                in_atom_loop=True\n",
+    "            else:\n",
+    "                in_atom_loop=False\n",
+    "            \n",
+    "            if in_aa_loop:\n",
+    "                counter_aa+=1\n",
+    "                split_line = line.split()\n",
+    "                one_indexed_pos = int(split_line[2])\n",
+    "                new_value = new_b_values[one_indexed_pos]\n",
+    "                new_value = round(new_value,4)\n",
+    "                split_line[4] = str(new_value)\n",
+    "                newline = '\\t'.join(split_line)+'\\n'\n",
+    "            \n",
+    "            if in_atom_loop:\n",
+    "                counter_atom+=1\n",
+    "                split_line = re.split(r'(\\s+)', line)\n",
+    "                one_indexed_pos = int(split_line[16])\n",
+    "                new_value = new_b_values[one_indexed_pos]\n",
+    "                new_value = round(new_value,4)\n",
+    "                split_line[28] = str(new_value)\n",
+    "                newline = ''.join(split_line)\n",
+    "                #print(split_line)\n",
+    "            \n",
+    "            if counter_aa==seqlen:\n",
+    "                done_aa=True\n",
+    "                in_aa_loop=False\n",
+    "                \n",
+    "            f.write(newline)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# read ETV6::NTRK3 predictions\n",
+    "path_to_preds = \"../discovery/results/final/ETV6::NTRK3/full_results_with_logits.csv\"\n",
+    "preds = pd.read_csv(path_to_preds)\n",
+    "original_seq = ''.join(preds['original_residue'].tolist())\n",
+    "print(original_seq)\n",
+    "\n",
+    "# what is the structural data for this\n",
+    "structure_data = fusion_structure_data.loc[\n",
+    "    fusion_structure_data['Fusion_Seq']==original_seq\n",
+    "].reset_index(drop=True).loc[0]\n",
+    "print(structure_data.to_string())\n",
+    "print(structure_data['top_hg_UniProt_fus_indices'], structure_data['top_tg_UniProt_fus_indices'])\n",
+    "print(structure_data['Fusion_Structure_Link'])\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path_to_cif = f\"../../caid/raw_data/fusionpdb/structures/{structure_data['Fusion_Structure_Link']}\"\n",
+    "path_to_modified_cif = f\"processed_data/{structure_data['Fusion_Structure_Link'].split('.')[0]}_logit_bfactor.cif\"\n",
+    "\n",
+    "new_b_values = dict(zip(preds[\"Residue\"],preds[\"original_residue_logit\"]))\n",
+    "\n",
+    "modify(path_to_cif, path_to_modified_cif, new_b_values)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# spectrum b to color by logits\n",
+    "spectrum b\n",
+    "\n",
+    "sele head, resi 1-338\n",
+    "sele tail, resi 339-647\n",
+    "sele kinase, resi 346-647\n",
+    "set cartoon_transparency, 0.8, tail\n",
+    "set cartoon_transparency, 0.8, kinase\n",
+    "sele G623R, resi 431\n",
+    "sele G696A, resi 504\n",
+    "color orange, G623R\n",
+    "set cartoon_transparency, 0, G623R\n",
+    "show licorice, G623R\n",
+    "color orange, G696A\n",
+    "set cartoon_transparency, 0, G696A\n",
+    "show licorice, G696A"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# PYMOL code for ETV6::NTRK3\n",
+    "\n",
+    "\n",
+    "# Define a custom color for the kinase domain\n",
+    "set_color custom_red, [0xdf/255, 0x83/255, 0x85/255]\n",
+    "set_color custom_blue, [0x6e/255, 0xa4/255, 0xda/255]\n",
+    "\n",
+    "# Select and color residues 1085-1336\n",
+    "sele head, resi 1-338\n",
+    "sele tail, resi 339-647\n",
+    "sele kinase, resi 346-647\n",
+    "color custom_red, head\n",
+    "color custom_blue, tail\n",
+    "set cartoon_transparency, 0, tail\n",
+    "set cartoon_transparency, 0.8, kinase\n",
+    "sele G623R, resi 431\n",
+    "sele G696A, resi 504\n",
+    "color magenta, G623R\n",
+    "set cartoon_transparency, 0, G623R\n",
+    "show licorice, G623R\n",
+    "color magenta, G696A\n",
+    "set cartoon_transparency, 0, G696A\n",
+    "show licorice, G696A\n",
+    "\n",
+    "# Color the known mutations that impact drug efficacy\n",
+    "\n",
+    "# Select missed mutation residues, color them viridis, and make them fully opaque\n",
+    "# do the viridis outside the "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# PYMOL code for ETV6::NTRK3\n",
+    "# Set global cartoon transparency\n",
+    "# 0 for zoomed out, 0.5 for close up\n",
+    "color gray60\n",
+    "set cartoon_transparency, 0\n",
+    "\n",
+    "# Define a custom color for the kinase domain\n",
+    "set_color custom_blue, [0x6e/255, 0xa4/255, 0xda/255]\n",
+    "\n",
+    "# Select and color residues 1085-1336\n",
+    "sele ntrk3_kinase, resi 225-526\n",
+    "color custom_blue, ntrk3_kinase\n",
+    "\n",
+    "# Select missed mutation residues, color them orange, and make them fully opaque\n",
+    "spectrum b, blue_white_red, minimum=0.8, maximum=1"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [
+    "FJd6a9gdZNjG",
+    "zORkLJztZWp9",
+    "w25hagtZaV65",
+    "IbyqxlvAFUAK",
+    "0n5PSprbhLk7"
+   ],
+   "machine_shape": "hm",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

fuson_plm/benchmarking/mutation_prediction/discovery/config.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Model settings: where is the model you wish to use for mutation discovery?
+FUSON_PLM_CKPT = "FusOn-pLM"
+#### Fill in this sectinon if you have one input
+# Sequence settings: need full sequence of fusion oncoprotein, and the bounds of region of interest
+FULL_FUSION_SEQUENCE = ""
+FUSION_NAME = "fusion_example"
+START_RESIDUE_INDEX = 1
+END_RESIDUE_INDEX = 100
+N = 3           # number of mutations to predict per amio acid
+#### Fill in this section if you have multiple input
+PATH_TO_INPUT_FILE = "processed_data/domain_conservation_fusions_inputfile.csv"   # if you don't have an input file and want to do one sequence, set this variable to None
+# GPU Settings: which GPUs should be available to run this discovery?
+CUDA_VISIBLE_DEVICES = "0"

fuson_plm/benchmarking/mutation_prediction/discovery/discover.py ADDED Viewed

	@@ -0,0 +1,346 @@

+##### Discover mutations in new sequences. A tool
+import fuson_plm.benchmarking.mutation_prediction.discovery.config as config
+import os
+import pickle
+os.environ['CUDA_VISIBLE_DEVICES'] = config.CUDA_VISIBLE_DEVICES
+import pandas as pd
+import numpy as np
+import transformers
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import logging
+import torch
+import matplotlib.pyplot as plt
+import seaborn as sns
+import torch.nn.functional as F
+from fuson_plm.utils.logging import open_logfile, log_update, get_local_time, print_configpy
+from fuson_plm.utils.embedding import load_esm2_type
+from fuson_plm.utils.visualizing import set_font
+from fuson_plm.benchmarking.mutation_prediction.recovery.recover import check_env_variables, predict_positionwise_mutations
+from fuson_plm.benchmarking.mutation_prediction.discovery.plot import plot_conservation_heatmap, plot_full_heatmap
+def check_seq_inputs(sequence, AAs_tokens):
+  # checking sequence inputs for validity
+  if not sequence.strip():
+    raise Exception("Error: The sequence input is empty. Please enter a valid protein sequence.")
+    return None, None, None
+  if any(char not in AAs_tokens for char in sequence):
+    raise Exception("Error: The sequence input contains non-amino acid characters. Please enter a valid protein sequence.")
+    return None, None, None
+def check_domain_bounds(domain_bounds):
+  try:
+    start = int(domain_bounds['start'])
+    end = int(domain_bounds['end'])
+    return start, end
+  except ValueError:
+    raise Exception("Error: Start and end indices must be integers.")
+    return None, None
+  if start >= end:
+    raise Exception("Start index must be smaller than end index.")
+    return None, None
+  if start == 0 and end != 0:
+    raise Exception("Indexing starts at 1. Please enter valid domain bounds.")
+    return None, None
+  if start <= 0 or end <= 0:
+    raise Exception("Domain bounds must be positive integers. Please enter valid domain bounds.")
+    return None, None
+  if start > len(sequence) or end > len(sequence):
+    raise Exception("Domain bounds exceed sequence length.")
+    return None, None
+def check_n_input(n):
+  if n < 1:
+    raise Exception("Choose N>=1")
+    return None, None, None
+def predict_positionwise_mutations(sequence, domain_bounds, n, model, tokenizer, device):
+  # Define amino acids and their token indices
+  AAs_tokens = ['L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C']
+  AAs_tokens_indices = {'L' : 4, 'A' : 5, 'G' : 6, 'V': 7, 'S' : 8, 'E' : 9, 'R' : 10, 'T' : 11, 'I': 12, 'D' : 13, 'P' : 14,
+                        'K' : 15, 'Q' : 16, 'N' : 17, 'F' : 18, 'Y' : 19, 'M' : 20, 'H' : 21, 'W' : 22, 'C' : 23}
+  # checking all inputs for validity
+  log_update("\nChecking validity of sequence input, domain bounds, and N mutations")
+  check_seq_inputs(sequence, AAs_tokens)
+  start, end = check_domain_bounds(domain_bounds)
+  check_n_input(n)
+  # define start_index as start - 1 (because residues are 1-indexed, while Python is 0-indexed). end is same
+  start_index = start - 1
+  end_index = end
+  # place to store top n mutations and all logits
+  top_n_mutations = {}
+  top_n_advantage_mutations = {}
+  top_n_disadvantage_mutations = {}
+  logits_for_each_AA = []
+  llrs_for_each_AA = []
+  # storage for the conservation heatmap
+  originals_logits = []
+  conservation_likelihoods = {}
+  log_update("\nCalculating mutations. Printing currently masked position and mutation results.")
+  for i in range(len(sequence)):
+    # only iterate through the residues inside the domain
+        if start_index <= i <= (end_index - 1):
+            # isolate original residue and its index
+            original_residue = sequence[i]
+            original_residue_index = AAs_tokens_indices[original_residue]
+            masked_seq = sequence[:i] + '<mask>' + sequence[i+1:]
+            # prepare log
+            masked_seq_list = list(sequence[:i]) + ['<mask>'] + list(sequence[i+1:])
+            log_starti = i-min(5, i)
+            log_endi = i+5
+            log_update(f"\t{i+1}: residue = {original_residue}, masked sequence preview (pos {log_starti+1}-{log_endi}) = {''.join(masked_seq_list[log_starti:log_endi])}")
+            # prepare inputs
+            inputs = tokenizer(masked_seq, return_tensors="pt", padding=True, truncation=True, max_length=len(masked_seq)+2)
+            inputs = {k: v.to(device) for k, v in inputs.items()}
+            # forward pass
+            with torch.no_grad():
+                logits = model(**inputs).logits
+            # Find masked positions and extract their logits
+            mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
+            mask_token_logits = logits[0, mask_token_index, :]    # shape: [1, vocab_size] == [1, 33]. logits for each vocab word at this position in the sequence
+            # Collect logits for the full heamtap
+            logits_array = mask_token_logits.cpu().numpy()  # shape: [1, 33]
+            # filter out non-amino acid tokens
+            filtered_indices = list(range(4, 23 + 1))   # filtered indices are indices of amino acids
+            filtered_logits = logits_array[:, filtered_indices] # shape: [1, 20]  only the 20 amino acids
+            logits_for_each_AA.append(filtered_logits)        # get logits for each amino acid
+            # Collect LLRs for the LLR heatmap
+            log_probabilities = F.log_softmax(torch.tensor(mask_token_logits).cpu(), dim=-1).squeeze(0)  # take log softmax of the [33] dimension
+            log_prob_og = log_probabilities[original_residue_index] # get the log probability of the TRUE residue underneath the mask
+            llrs = torch.tensor([(x-log_prob_og) for x in log_probabilities]) # calculate the LLR
+            #print(original_residue_index, llrs)
+            filtered_llrs = llrs[filtered_indices].numpy()   # filter so it's [20], just the amino acids; only save this
+            filtered_llrs_array = np.array([filtered_llrs])
+            llrs_for_each_AA.append(filtered_llrs_array)
+            ######### Top tokens
+            # Get top tokens based on LOGITS
+            all_tokens_logits = mask_token_logits.squeeze(0) # shape: [vocab_size] == [33]
+            top_tokens_indices = torch.argsort(all_tokens_logits, dim=0, descending=True) # sort the logits
+            mutation = []
+            # make sure we don't include non-AA tokens
+            for token_index in top_tokens_indices:
+                decoded_token = tokenizer.decode([token_index.item()])
+                # decoded all tokens, pick the top n amino acid ones
+                if decoded_token in AAs_tokens:
+                    mutation.append(decoded_token)
+                    if len(mutation) == n:
+                        break
+            top_n_mutations[(sequence[i], i)] = mutation
+            log_update(f"\t\ttop {n} predicted AAs: {','.join(mutation)}")
+            # Get top tokens based on LLR
+            top_advantage_tokens_indices = torch.argsort(llrs, dim=0, descending=True) # sort the LLRs
+            advantage_mutation = []
+            # make sure we don't include non-AA tokens
+            for token_index in top_advantage_tokens_indices:
+                decoded_token = tokenizer.decode([token_index.item()])
+                # decoded all tokens, pick the top n amino acid ones
+                if decoded_token in AAs_tokens:
+                    advantage_mutation.append(decoded_token)
+                    if len(advantage_mutation) == n:
+                        break
+            top_n_advantage_mutations[(sequence[i], i)] = advantage_mutation
+            log_update(f"\t\ttop {n} predicted advantageous mutations: {','.join(advantage_mutation)}")
+            # Get top tokens based on LLR
+            top_disadvantage_tokens_indices = torch.argsort(llrs, dim=0, descending=False) # sort the LLRs
+            disadvantage_mutation = []
+            # make sure we don't include non-AA tokens
+            for token_index in top_disadvantage_tokens_indices:
+                decoded_token = tokenizer.decode([token_index.item()])
+                # decoded all tokens, pick the top n amino acid ones
+                if decoded_token in AAs_tokens:
+                    disadvantage_mutation.append(decoded_token)
+                    if len(disadvantage_mutation) == n:
+                        break
+            top_n_disadvantage_mutations[(sequence[i], i)] = disadvantage_mutation
+            log_update(f"\t\ttop {n} predicted disadvantageous mutations: {','.join(disadvantage_mutation)}")
+            # fill in the logits and conservation likelihoods for the second array
+            normalized_mask_token_logits = F.softmax(torch.tensor(mask_token_logits).cpu(), dim=-1).numpy()
+            normalized_mask_token_logits = np.squeeze(normalized_mask_token_logits)
+            originals_logit = normalized_mask_token_logits[original_residue_index]
+            originals_logits.append(originals_logit)
+            # a region is conserved if the probability of the amino acid is > 0.7; AKA, probability of a mutation is <= 0.3
+            if originals_logit > 0.7:
+                conservation_likelihoods[(original_residue, i)] = 1
+                log_update("\t\tConserved position")
+            else:
+                conservation_likelihoods[(original_residue, i)] = 0
+                log_update("\t\tNot conserved position")
+  # return a dictionary of all the things we need for the next part
+  return {
+    'start': start,
+    'end': end,
+    'originals_logits': originals_logits,
+    'conservation_likelihoods': conservation_likelihoods,
+    'logits': logits,
+    'filtered_indices': filtered_indices,
+    'top_n_mutations': top_n_mutations,
+    'top_n_advantage_mutations': top_n_advantage_mutations,
+    'top_n_disadvantage_mutations': top_n_disadvantage_mutations,
+    'logits_for_each_AA': logits_for_each_AA,
+    'llrs_for_each_AA': llrs_for_each_AA
+  }
+def find_top_3(d):
+     temp = pd.DataFrame.from_dict(d, orient='index').reset_index()
+     temp = temp.sort_values(by=0,ascending=False).reset_index(drop=True)
+     temp = temp.iloc[0:3,:]
+     return_d = dict(zip(temp['index'],temp[0]))
+     return return_d
+def make_full_results_df(mutation_results, tokenizer, original_sequence):
+  # Unpack mutation results
+  logits = mutation_results['logits']
+  logits_for_each_AA = mutation_results['logits_for_each_AA']
+  filtered_indices = mutation_results['filtered_indices']
+  llrs_for_each_AA = mutation_results['llrs_for_each_AA']
+  token_indices = torch.arange(logits.size(-1))
+  tokens = [tokenizer.decode([idx]) for idx in token_indices]
+  filtered_tokens = [tokens[i] for i in filtered_indices]
+  all_logits_array = np.vstack(logits_for_each_AA)
+  normalized_logits_array = F.softmax(torch.tensor(all_logits_array), dim=-1).numpy()
+  transposed_logits_array = normalized_logits_array.T
+  df = pd.DataFrame(transposed_logits_array.T)
+  df.columns = filtered_tokens
+  df.index = list(range(1, len(df)+1))
+  df['all_logits'] = df[filtered_tokens].to_dict(orient='index')
+  df['top_3_mutations'] = df['all_logits'].apply(lambda x: find_top_3(x))
+  df['original_residue'] = list(original_sequence)
+  df['original_residue_logit'] = df.apply(lambda row: row['all_logits'][row['original_residue']],axis=1)
+  df = df[['original_residue','original_residue_logit','all_logits','top_3_mutations']]
+  df = df.reset_index().rename(columns={'index':'Residue'})
+  return df
+def make_small_results_df(mutation_results):
+  conservation_likelihoods = mutation_results['conservation_likelihoods']
+  top_n_mutations = mutation_results['top_n_mutations']
+  # store the predicted mutations in a dataframe
+  original_residues = []
+  mutations = []
+  positions = []
+  conserved = []
+  for key, value in top_n_mutations.items():
+      original_residue, position = key
+      original_residues.append(original_residue)
+      mutations.append(','.join(value))
+      positions.append(position + 1)
+  for i, (key, value) in enumerate(conservation_likelihoods.items()):
+      original_residue, position = key
+      if original_residues[i]==original_residue:  # it should, otherwise something is wrong
+        conserved.append(value)
+  df = pd.DataFrame({
+      'Original Residue': original_residues,
+      'Predicted Residues': mutations,
+      'Conserved': conserved,
+      'Position': positions
+  })
+  return df
+def main():
+  # Make results directory
+  os.makedirs('results',exist_ok=True)
+  output_dir = f'results/{get_local_time()}'
+  os.makedirs(output_dir,exist_ok=True)
+  # Predict mutations, writing results to a log inside of the output directory
+  with open_logfile(f"{output_dir}/mutation_discovery_log.txt"):
+      print_configpy(config)
+      # Make sure environment variables are set correctly
+      check_env_variables()
+      # Get device
+      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+      log_update(f"Using device: {device}")
+      # Load fuson as AutoModelForMaskedLM
+      fuson_ckpt_path = config.FUSON_PLM_CKPT
+      if fuson_ckpt_path=="FusOn-pLM":
+          fuson_ckpt_path="../../../.."
+          model_name = "fuson_plm"
+          model_epoch = "best"
+          model_str = f"fuson_plm/best"
+      else:
+          model_name = list(fuson_ckpt_path.keys())[0]
+          epoch = list(fuson_ckpt_path.values())[0]
+          fuson_ckpt_path = f'../../training/checkpoints/{model_name}/checkpoint_epoch_{epoch}'
+          model_name, model_epoch = fuson_ckpt_path.split('/')[-2::]
+          model_epoch = model_epoch.split('checkpoint_')[-1]
+          model_str = f"{model_name}/{model_epoch}"
+      log_update(f"\nLoading FusOn-pLM model from {fuson_ckpt_path}")
+      fuson_tokenizer = AutoTokenizer.from_pretrained(fuson_ckpt_path)
+      fuson_model = AutoModelForMaskedLM.from_pretrained(fuson_ckpt_path)
+      fuson_model.to(device)
+      fuson_model.eval()
+      if (config.PATH_TO_INPUT_FILE is not None) and os.path.exists(config.PATH_TO_INPUT_FILE):
+        input_file = pd.read_csv(config.PATH_TO_INPUT_FILE)
+      else:
+        input_file = pd.DataFrame(
+          data={
+            'fusion_name': [config.FUSION_NAME],
+            'full_fusion_sequence': [config.FULL_FUSION_SEQUENCE],
+            'start_residue_index': [config.START_RESIDUE_INDEX],
+            'end_residue_index': [config.END_RESIDUE_INDEX],
+            'n': [config.N]
+          }
+        )
+      log_update(f"\nThere are {len(input_file)} total sequences on which to perform mutation discovery. Fusion Genes:")
+      log_update("\t" + "\n\t".join(input_file['fusion_name']))
+      # Loop through each input and make a subfolder with its data
+      for i in range(len(input_file)):
+        row = input_file.loc[i,:]
+        fusion_name = row['fusion_name']
+        full_fusion_sequence = row['full_fusion_sequence']
+        start_residue_index = row['start_residue_index']
+        end_residue_index = row['end_residue_index']
+        n = row['n']
+        sub_output_dir = f"{output_dir}/{fusion_name}"
+        os.makedirs(sub_output_dir,exist_ok=True)
+        # Predict postionwise mutations, plot the results
+        domain_bounds = {'start': start_residue_index, 'end': end_residue_index}
+        mutation_results = predict_positionwise_mutations(full_fusion_sequence, domain_bounds, n,
+                                                          fuson_model, fuson_tokenizer, device)
+        # Save mutation results
+        with open(f"{sub_output_dir}/raw_mutation_results.pkl", "wb") as f:
+          pickle.dump(mutation_results, f)
+        # Plot the heatmaps
+        plot_full_heatmap(mutation_results, fuson_tokenizer,
+                          fusion_name=fusion_name, save_path=f"{sub_output_dir}/full_heatmap.png")
+        plot_conservation_heatmap(mutation_results,
+                                  fusion_name=fusion_name, save_path=f"{sub_output_dir}/conservation_heatmap.png")
+        # Make results dataframe
+        small_mutation_results_df = make_small_results_df(mutation_results)
+        small_mutation_results_df.to_csv(f"{sub_output_dir}/predicted_tokens.csv",index=False)
+        full_mutation_results_df = make_full_results_df(mutation_results, fuson_tokenizer, full_fusion_sequence)
+        full_mutation_results_df.to_csv(f"{sub_output_dir}/full_results_with_logits.csv",index=False)
+if __name__ == "__main__":
+    main()

fuson_plm/benchmarking/mutation_prediction/discovery/make_color_bar.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import matplotlib.pyplot as plt
+import numpy as np
+def plot_color_bar():
+    """
+    Create a Viridis color bar ranging from 0 to 1.
+    """
+    # Create a gradient from 0 to 1
+    gradient = np.linspace(0, 1, 256).reshape(1, -1)
+    # Plot the gradient as a color bar
+    fig, ax = plt.subplots(figsize=(12, 3))
+    ax.imshow(gradient, aspect="auto", cmap="viridis")
+    ax.set_xticks([0, 255])
+    ax.set_xticklabels(["0\nmost likely\nto mutate", "1\nleast likely\nto mutate"], fontsize=40)
+    ax.set_yticks([])
+    ax.set_title("Original Residue Logits", fontsize=40)
+    # Save the figure
+    plt.tight_layout()
+    plt.show()
+    plt.savefig("viridis_color_bar.png", dpi=300)
+# Call the function to create and display the color bar
+plot_color_bar()

fuson_plm/benchmarking/mutation_prediction/discovery/plot.py ADDED Viewed

	@@ -0,0 +1,167 @@

+import torch
+import matplotlib.pyplot as plt
+import seaborn as sns
+import torch.nn.functional as F
+import numpy as np
+import os
+import pandas as pd
+import pickle
+from transformers import AutoTokenizer
+from fuson_plm.utils.visualizing import set_font
+import fuson_plm.benchmarking.mutation_prediction.discovery.config as config
+def get_x_tick_labels(start, end):
+  # Define start and end index which we actually use to index the sequence
+  start_index = start - 1
+  end_index = end
+  # Define domain length
+  domain_len = end - start
+  if 500 > domain_len > 100:
+    step_size = 50
+  elif 500 <= domain_len:
+    step_size = 100
+  elif domain_len < 10:
+    step_size = 1
+  else:
+    step_size = 10
+  # Define x tick positions based on step size
+  x_tick_positions = np.arange(start_index, end_index, step_size)
+  x_tick_labels = [str(pos + 1) for pos in x_tick_positions]
+  return x_tick_positions, x_tick_labels
+def plot_conservation_heatmap(mutation_results, fusion_name="Fusion Oncoprotein", save_path="conservation_heatmap.png"):
+    start = mutation_results['start']
+    end = mutation_results['end']
+    originals_logits = mutation_results['originals_logits']
+    conservation_likelihoods = mutation_results['conservation_likelihoods']
+    logits = mutation_results['logits']
+    logits_for_each_AA = mutation_results['logits_for_each_AA']
+    filtered_indices = mutation_results['filtered_indices']
+    top_n_mutations = mutation_results['top_n_mutations']
+    # Get start index and end index
+    start_index = start - 1
+    end_index = end
+    # Make conservation likelihoods array for plotting
+    all_logits_array = np.vstack(originals_logits)
+    transposed_logits_array = all_logits_array.T
+    conservation_likelihoods_array = np.array(list(conservation_likelihoods.values())).reshape(1, -1)
+    # combine to make a 2D heatmap
+    combined_array = np.vstack((transposed_logits_array, conservation_likelihoods_array))
+    # Get ticks
+    x_tick_positions, x_tick_labels = get_x_tick_labels(start, end)
+    # Plot!
+    set_font()
+    # Adjust the figure size: constant height (e.g., 3) and width proportional to sequence length
+    sequence_length = end_index - start_index
+    fig = plt.figure(figsize=(min(15, sequence_length / 10), 3))  # Adjust width dynamically, keep height constant
+    #plt.rcParams.update({'font.size': 16.5})  # make font size bigger
+    ax = sns.heatmap(
+      combined_array,
+      cmap='viridis',
+      xticklabels=x_tick_labels,
+      yticklabels=['Original Logits', 'Conserved'],
+      cbar=True,
+      cbar_kws={'aspect': 2,
+                'pad': 0.02,
+                'shrink': 1.0,  # Adjust the overall size of the color bar
+      }
+    )
+    # Access the color bar
+    cbar = ax.collections[0].colorbar
+    # Change the font size of the tick labels on the color bar
+    cbar.ax.tick_params(labelsize=20)  # Adjust the font size of tick labels
+    plt.xticks(x_tick_positions - start_index + 0.5, x_tick_labels, rotation=90, fontsize=20)
+    plt.yticks(fontsize=20, rotation=0)
+    plt.title(f'{fusion_name} Residues {start}-{end}', fontsize=30)
+    plt.xlabel('Residue Index', fontsize=30)
+    plt.tight_layout()
+    plt.show()
+    # save the figure
+    plt.savefig(save_path, format='png', dpi=300)
+# plotting heatmap 1
+def plot_full_heatmap(mutation_results, tokenizer, fusion_name="Fusion Oncoprotein", save_path="full_heatmap.png"):
+  start = mutation_results['start']
+  end = mutation_results['end']
+  logits = mutation_results['logits']
+  logits_for_each_AA = mutation_results['logits_for_each_AA']
+  filtered_indices = mutation_results['filtered_indices']
+  # get start and end index
+  start_index = start - 1
+  end_index = end
+  # prepare data for plotting
+  token_indices = torch.arange(logits.size(-1))
+  tokens = [tokenizer.decode([idx]) for idx in token_indices]
+  filtered_tokens = [tokens[i] for i in filtered_indices]
+  all_logits_array = np.vstack(logits_for_each_AA)
+  normalized_logits_array = F.softmax(torch.tensor(all_logits_array), dim=-1).numpy()
+  transposed_logits_array = normalized_logits_array.T
+  # get x tick labels
+  x_tick_positions, x_tick_labels = get_x_tick_labels(start, end)
+  # make plot
+  set_font()
+  fig = plt.figure(figsize=(15, 8))
+  plt.rcParams.update({'font.size': 16.5})
+  sns.heatmap(transposed_logits_array, cmap='plasma', xticklabels=x_tick_labels, yticklabels=filtered_tokens)
+  plt.title(f'{fusion_name} Residues {start}-{end}: Token Probability')
+  plt.ylabel('Amino Acid')
+  plt.xlabel('Residue Index')
+  plt.yticks(rotation=0)
+  plt.xticks(x_tick_positions - start_index + 0.5, x_tick_labels, rotation=0)
+  plt.tight_layout()
+  plt.savefig(save_path, format='png', dpi = 300)
+def plot_color_bar():
+    """
+    Create a Viridis color bar ranging from 0 to 1.
+    """
+    # Create a gradient from 0 to 1
+    gradient = np.linspace(0, 1, 256).reshape(1, -1)
+    # Plot the gradient as a color bar
+    fig, ax = plt.subplots(figsize=(12, 3))
+    ax.imshow(gradient, aspect="auto", cmap="viridis")
+    ax.set_xticks([0, 255])
+    ax.set_xticklabels(["0\nmost likely\nto mutate", "1\nleast likely\nto mutate"], fontsize=40)
+    ax.set_yticks([])
+    ax.set_title("Original Residue Logits", fontsize=40)
+    # Save the figure
+    plt.tight_layout()
+    plt.show()
+    plt.savefig("viridis_color_bar.png", dpi=300)
+def main():
+    # Call the function to create and display the color bar
+    plot_color_bar()
+    results_dir = "results/final"
+    subfolders = os.listdir(results_dir)
+    for subfolder in subfolders:
+        full_path = f"{results_dir}/{subfolder}"
+        if os.path.isdir(full_path):
+            with open(f"{full_path}/raw_mutation_results.pkl", "rb") as f:
+                mutation_results = pickle.load(f)
+                plot_conservation_heatmap(mutation_results,
+                                          fusion_name=subfolder, save_path=f"{full_path}/conservation_heatmap.png")
+if __name__ == "__main__":
+    main()

fuson_plm/benchmarking/mutation_prediction/discovery/processed_data/521_logit_bfactor.cif ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin → fuson_plm/benchmarking/mutation_prediction/discovery/processed_data/domain_conservation_fusions_inputfile.csv RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3a0b878f4a6dfdeb5bc3acb5ceb82980c7840c4afe5bc1063c23e20e9f8da623
-size 2609617594

 version https://git-lfs.github.com/spec/v1
+oid sha256:a7bef22b6753a20845f095872431e1a222ac2be40ffa20bfe36113d80a7b57b2
+size 3119

fuson_plm/benchmarking/mutation_prediction/discovery/processed_data/test_seqs_tftf_kk.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e77c7777c0163df7b4ea740d16c5aff0a02cec03a251130dde0cf9b291d4fa2
+size 4023866

fuson_plm/benchmarking/mutation_prediction/discovery/raw_data/salokas_2020_tableS3.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d8bebc0871a4329015a3c6c7843f5bbc86c48811b2a836c42f1ef46b37f4282a
+size 19626

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/conservation_heatmap.png ADDED Viewed

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/full_results_with_logits.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e881d4a4c081da84a60991e2ad9598fcf1dba47cbcb92b5cb6d70c29f4217ea
+size 432231

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/predicted_tokens.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3349b52a936dbce7810faf3e17a296cf63701fb87e64d51de2d880575778b1b0
+size 10299

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/ETV6::NTRK3/raw_mutation_results.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9762879aacee6bd42440326c3d7b68871df8caacf97127db213c47eec9fa7d1f
+size 315798

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/conservation_heatmap.png ADDED Viewed

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/full_results_with_logits.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:82cddf491a55bf2254ab86b2a20baf161097b5538a5868c0660c88baf856f672
+size 349203

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/predicted_tokens.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ebd172fd6717511efd5b4971e0370381f1a29d29b12dbc196b82585ed2ef5065
+size 8363

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/EWSR1::FLI1/raw_mutation_results.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4973a6fc35ee9647f02f83627181b83e5f861c5918dc40abf4abb872d3e7088b
+size 256739

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/conservation_heatmap.png ADDED Viewed

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/full_results_with_logits.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d7841257792c6d83d2cfcd5456bf8761b3d3fc0dd69cd912ecfc05a80ad3dec
+size 545639

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/predicted_tokens.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e8d94cfad62e2b62bf60d62cfef02fa2328f5d37c7f1c141575d098896bc2103
+size 13323

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/PAX3::FOXO1/raw_mutation_results.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0dab0b3ee7ea7f9c3c67fdef16bf81dd48654bb8bfdf12eedd58e1972d908a24
+size 408037

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/conservation_heatmap.png ADDED Viewed

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/full_results_with_logits.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9ad3fc01c9ce3c67330d8fc79d14a551aa4d7f8c8d596a49e8e8042d6a8fbb1b
+size 635205

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/predicted_tokens.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22cdcd8cfbd5f70f1067d2104e6f0e3fc5e4e5e3fddd736934ce905b35cd0cea
+size 15195

fuson_plm/benchmarking/mutation_prediction/discovery/results/final/TRIM24::RET/raw_mutation_results.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1d719d23b4c7e6deb79ac9ef3c5c84a1c0c3a2e4ae9926f9a460731bd78f9e0c
+size 465133

fuson_plm/benchmarking/mutation_prediction/discovery/viridis_color_bar.png ADDED Viewed

fuson_plm/benchmarking/mutation_prediction/recovery/abl_mutations.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22679a1f55d49ed4a8d5309c27215a4a204f47b49ce0a7a266a674208e381e85
+size 307

fuson_plm/benchmarking/mutation_prediction/recovery/alk_mutations.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9339f8cd5ea5369a9b87ca3b420ee2731c4e057d5c546c3c2a09c727154721df
+size 192

fuson_plm/benchmarking/mutation_prediction/recovery/color_recovered_mutations_public.ipynb ADDED Viewed

	@@ -0,0 +1,314 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FJd6a9gdZNjG"
+   },
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Put path to model predictions you'd like to use for benchmarking\n",
+    "path_to_bcr_abl_preds = \"results/11-21-2024-16:03:59/Supplementary Tables - BCR ABL Mutations.csv\"\n",
+    "path_to_eml4_alk_preds = \"results/11-21-2024-16:03:59/Supplementary Tables -  EML4 ALK Mutations.csv\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install torch pandas numpy py3Dmol scikit-learn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "id": "ZEWZVc9lUxjI"
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "import pickle\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "import py3Dmol\n",
+    "\n",
+    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, average_precision_score"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bcr_abl_preds = pd.read_csv(path_to_bcr_abl_preds)\n",
+    "eml4_alk_preds = pd.read_csv(path_to_eml4_alk_preds)\n",
+    "bcr_abl_seq = ''.join(bcr_abl_preds['Original Residue'].tolist())\n",
+    "eml4_alk_seq = ''.join(eml4_alk_preds['Original Residue'].tolist())\n",
+    "\n",
+    "print(\"BCR::ABL seq: \", bcr_abl_seq)\n",
+    "print(\"EML4::ALK seq: \", eml4_alk_seq)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bcr_abl_preds = bcr_abl_preds[bcr_abl_preds['Literature Mutation'].notna()].reset_index(drop=True)\n",
+    "eml4_alk_preds = eml4_alk_preds[eml4_alk_preds['Literature Mutation'].notna()].reset_index(drop=True)\n",
+    "\n",
+    "bcr_abl_preds"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Calculate hits\n",
+    "print(\"BCR::ABL\", len(bcr_abl_preds), sum(bcr_abl_preds['Hit']))\n",
+    "print(\"EML4::ALK\", len(eml4_alk_preds), sum(eml4_alk_preds['Hit']))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.getcwd()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Color EML4 ALK structure\n",
+    "path_to_eml4_alk_structure = np.nan # not publcly available\n",
+    "eml4_alk_seq = np.nan   # not publicly available\n",
+    "alk_kinase_domain_seq = np.nan # not publicly available \n",
+    "alk_kinase_domain_resis = [eml4_alk_seq.index(alk_kinase_domain_seq)+1, eml4_alk_seq.index(alk_kinase_domain_seq)+len(alk_kinase_domain_seq)]\n",
+    "print(alk_kinase_domain_resis)\n",
+    "print(len(eml4_alk_seq))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "l = list(range(567,843+1))\n",
+    "l = [str(x) for x in l]\n",
+    "print('+'.join(l))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "eml4_alk_mutation_resis = eml4_alk_preds['Position'].tolist()\n",
+    "eml4_alk_hit_resis = eml4_alk_preds.loc[\n",
+    "    eml4_alk_preds['Hit']\n",
+    "]['Position'].tolist()\n",
+    "kinase_domain_coloring = {\n",
+    "    'kinase': [alk_kinase_domain_resis[0], alk_kinase_domain_resis[1], '#6ea4da']\n",
+    "}\n",
+    "\n",
+    "missed_mut_resis = [x for x in eml4_alk_mutation_resis if x not in eml4_alk_hit_resis]\n",
+    "print('missed', '+'.join([str(x) for x in missed_mut_resis]))\n",
+    "\n",
+    "print('hit', '+'.join([str(x) for x in eml4_alk_hit_resis]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# PYMOL code for EML4 ALK\n",
+    "# Set global cartoon transparency\n",
+    "# 0.5 for close up\n",
+    "color gray60\n",
+    "set cartoon_transparency, 0.6\n",
+    "\n",
+    "# Define a custom color\n",
+    "set_color custom_blue, [0x6e/255, 0xa4/255, 0xda/255]\n",
+    "\n",
+    "# Select and color residues 567-843\n",
+    "sele alk, resi 567-843\n",
+    "color custom_blue, alk\n",
+    "\n",
+    "# Select missed mutation residues, color them orange, and make them fully opaque\n",
+    "sele missed_mut_resis, resi 603+707\n",
+    "color orange, missed_mut_resis\n",
+    "set cartoon_transparency, 0, missed_mut_resis\n",
+    "\n",
+    "# Select hit mutation residues, color them magenta, and make them fully opaque\n",
+    "sele hit_mut_resis, resi 607+622+625+631+647+649+653+654+657+661+696+720\n",
+    "color magenta, hit_mut_resis\n",
+    "set cartoon_transparency, 0, hit_mut_resis\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def color_mutation_recovery(pdb_file, mutation_resis, hit_resis, other_domain_coloring=None):\n",
+    "    # Create viewer\n",
+    "    viewer = py3Dmol.view()\n",
+    "\n",
+    "    # Load CIF file\n",
+    "    with open(pdb_file, 'r') as f:\n",
+    "        pdb_file = f.read()\n",
+    "\n",
+    "    # Add structure\n",
+    "    viewer.addModel(pdb_file, 'pdb')\n",
+    "    \n",
+    "    viewer.setStyle({})\n",
+    "    viewer.setStyle({'cartoon': {'color': 'lightgrey','transparency': 0.7}})\n",
+    "            \n",
+    "    # Apply colors based on normalized disorder values\n",
+    "    for i, value in enumerate(mutation_resis):\n",
+    "        style = {'cartoon': {'transparency': 1}}\n",
+    "        if value in hit_resis:\n",
+    "            style['cartoon']['color'] = 'yellow'\n",
+    "        else:\n",
+    "            style['cartoon']['color'] = 'red'\n",
+    "        # Apply combined style directly, avoiding any residual layering\n",
+    "        viewer.setStyle({'resi': value}, style)\n",
+    "    \n",
+    "    # If you want to color specific domains by default, do it here\n",
+    "    if other_domain_coloring is not None:\n",
+    "        for name, (start, end, color) in other_domain_coloring.items():\n",
+    "            resis = list(range(start, end+1))\n",
+    "            resis = [x for x in resis if x not in mutation_resis]\n",
+    "            viewer.setStyle({'resi': resis}, {'cartoon': {'color': color,'transparency': 0.7}})\n",
+    "\n",
+    "    # Show viewer\n",
+    "    viewer.zoomTo()\n",
+    "    return viewer.show()\n",
+    "\n",
+    "color_mutation_recovery(path_to_eml4_alk_structure, eml4_alk_mutation_resis, eml4_alk_hit_resis, \n",
+    "                        other_domain_coloring=kinase_domain_coloring)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Color BCR::ABL structure\n",
+    "path_to_bcr_abl_structure = np.nan # not publicly available\n",
+    "bcr_abl_seq =  np.nan   # not publicly available \n",
+    "abl_kinase_domain_seq =  np.nan # not publicly available \n",
+    "abl_kinase_domain_resis = [bcr_abl_seq.index(abl_kinase_domain_seq)+1, bcr_abl_seq.index(abl_kinase_domain_seq)+len(abl_kinase_domain_seq)]\n",
+    "print(abl_kinase_domain_resis)\n",
+    "print(abl_kinase_domain_seq)\n",
+    "print(len(bcr_abl_seq))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bcr_abl_mutation_resis = bcr_abl_preds['Position'].tolist()\n",
+    "bcr_abl_hit_resis = bcr_abl_preds.loc[\n",
+    "    bcr_abl_preds['Hit']\n",
+    "]['Position'].tolist()\n",
+    "kinase_domain_coloring = {\n",
+    "    'kinase': [abl_kinase_domain_resis[0], abl_kinase_domain_resis[1], '#6ea4da']\n",
+    "}\n",
+    "\n",
+    "bcr_abl_missed_mut_resis = [x for x in bcr_abl_mutation_resis if x not in bcr_abl_hit_resis]\n",
+    "print('missed residues', len(bcr_abl_missed_mut_resis), '+'.join([str(x) for x in bcr_abl_missed_mut_resis]))\n",
+    "print('hit residues', len(bcr_abl_hit_resis), '+'.join([str(x) for x in bcr_abl_hit_resis]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# PYMOL code for BCR ABL\n",
+    "# Set global cartoon transparency\n",
+    "# 0 for zoomed out, 0.5 for close up\n",
+    "color gray60\n",
+    "set cartoon_transparency, 0.6\n",
+    "\n",
+    "# Define a custom color\n",
+    "set_color custom_blue, [0x6e/255, 0xa4/255, 0xda/255]\n",
+    "\n",
+    "# Select and color residues 1085-1336\n",
+    "sele abl, resi 1085-1336\n",
+    "color custom_blue, abl\n",
+    "\n",
+    "# Select missed mutation residues, color them orange, and make them fully opaque\n",
+    "sele missed_mut_resis, resi 1087+1090+1093+1095+1116+1125+1128+1135+1140+1158+1192+1194+1218+1249+1273\n",
+    "color orange, missed_mut_resis\n",
+    "set cartoon_transparency, 0, missed_mut_resis\n",
+    "\n",
+    "# Select hit mutation residues, color them magenta, and make them fully opaque\n",
+    "sele hit_mut_resis, resi 1091+1096+1098+1132+1142+1154+1160+1198+1202+1222+1227+1230+1239\n",
+    "color magenta, hit_mut_resis\n",
+    "set cartoon_transparency, 0, hit_mut_resis"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [
+    "FJd6a9gdZNjG",
+    "zORkLJztZWp9",
+    "w25hagtZaV65",
+    "IbyqxlvAFUAK",
+    "0n5PSprbhLk7"
+   ],
+   "machine_shape": "hm",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

fuson_plm/benchmarking/mutation_prediction/recovery/config.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ FUSON_PLM_CKPT = "FusOn-pLM" # Dictionary: key = run name, values = epochs, or string "FusOn-pLM"
2	+
3	+ CUDA_VISIBLE_DEVICES = "0"

fuson_plm/benchmarking/mutation_prediction/recovery/recover_public.py ADDED Viewed

	@@ -0,0 +1,330 @@

+#### Recover mutations from literature. A benchmark
+import fuson_plm.benchmarking.mutation_prediction.recovery.config as config
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = config.CUDA_VISIBLE_DEVICES
+import pandas as pd
+import numpy as np
+import transformers
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import logging
+import torch
+import matplotlib.pyplot as plt
+import seaborn as sns
+import argparse
+import os
+import torch.nn.functional as F
+from fuson_plm.utils.logging import open_logfile, log_update, get_local_time, print_configpy
+from fuson_plm.benchmarking.embed import load_fuson_model
+def check_env_variables():
+    log_update("\nChecking on environment variables...")
+    log_update(f"\tCUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES')}")
+    log_update(f"\ttorch.cuda.device_count(): {torch.cuda.device_count()}")
+    for i in range(torch.cuda.device_count()):
+        log_update(f"\t\tDevice {i}: {torch.cuda.get_device_name(i)}")
+def get_top_k_aa_mutations(all_probabilities, sequence, i, top_k_mutations, k=10):
+    """
+    Should only return top AA mutations
+    """
+    all_probs = pd.DataFrame.from_dict(all_probabilities, orient='index').reset_index()
+    all_probs = all_probs.sort_values(by=0,ascending=False).reset_index(drop=True)
+    top_k_mutation = all_probs['index'].tolist()[0:k]
+    top_k_mutation = ",".join(top_k_mutation)
+    top_k_mutations[(sequence[i], i)] = (top_k_mutation, all_probabilities)
+    return top_k_mutations
+def get_top_k_mutations(tokenizer, mask_token_logits, all_probabilities, sequence, i, top_k_mutations, k=3):
+    top_k_tokens = torch.topk(mask_token_logits, k, dim=1).indices[0].tolist()
+    top_k_mutation = []
+    for token in top_k_tokens:
+        replaced_text = tokenizer.decode([token])
+        top_k_mutation.append(replaced_text)
+    top_k_mutation = ",".join(top_k_mutation)
+    top_k_mutations[(sequence[i], i)] = (top_k_mutation, all_probabilities)
+def predict_positionwise_mutations(model, tokenizer, device, sequence):
+    log_update("\t\tPredicting position-wise mutations...")
+    top_10_mutations = {}
+    decoded_full_sequence = ''
+    mut_count = 0
+    # Mask and unmask sequentially
+    for i in range(len(sequence)):
+        log_update(f"\t\t\t- pos {i+1}/{len(sequence)}")
+        all_probabilities = {}  # stored probabilities of each AA at this position
+        # Mask JUST the current position
+        masked_seq = sequence[:i] + '<mask>' + sequence[i+1:]
+        inputs = tokenizer(masked_seq, return_tensors="pt", padding=True, truncation=True,max_length=2000)
+        inputs = {k: v.to(device) for k, v in inputs.items()}
+        # Forward pass
+        with torch.no_grad():
+            logits = model(**inputs).logits
+        # Find logits at masked positions (should just be 1!)
+        mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
+        mask_token_logits = logits[0, mask_token_index, :]
+        mask_token_probs = F.softmax(mask_token_logits, dim=-1)
+        # Collect probabilities for natural AAs (token IDs 4-23 inclusive)
+        for token_idx in range(4, 23 + 1):
+            token = mask_token_probs[0, token_idx]
+            replaced_text = tokenizer.decode([token_idx])
+            all_probabilities[replaced_text] = token.item()
+        # Isolate top n mutations
+        #get_top_k_mutations(tokenizer, mask_token_logits, all_probabilities, sequence, i, top_10_mutations, k=10)
+        get_top_k_aa_mutations(all_probabilities, sequence, i, top_10_mutations, k=10)
+        # Building whole decoded sequence with top 1 token
+        top_1_tokens = torch.topk(mask_token_logits, 1, dim=1).indices[0].item()
+        new_residue = tokenizer.decode([top_1_tokens])
+        decoded_full_sequence += new_residue
+        # Check how many mutations in total
+        if sequence[i] != new_residue:
+            mut_count += 1
+    # Convert results into DataFrame
+    original_residues = []
+    top10_mutations = []
+    positions = []
+    all_logits = []
+    for (original_residue, position), (top10, probs) in top_10_mutations.items():
+        original_residues.append(original_residue)
+        top10_mutations.append(top10)
+        positions.append(position+1)            # originally this line was "position" but it should be position + 1
+        all_logits.append(probs)
+    df = pd.DataFrame({
+        'Original Residue': original_residues,
+        'Position': positions,
+        'Top 10 Mutations': top10_mutations,
+        'All Probabilities': all_logits,
+    })
+    df['Top Mutation'] = df['Top 10 Mutations'].apply(lambda x: x.split(',')[0])
+    df['Top 3 Mutations'] = df['Top 10 Mutations'].apply(lambda x: ','.join(x.split(',')[0:3]))
+    df['Top 4 Mutations'] = df['Top 10 Mutations'].apply(lambda x: ','.join(x.split(',')[0:4]))
+    df['Top 5 Mutations'] = df['Top 10 Mutations'].apply(lambda x: ','.join(x.split(',')[0:5]))
+    return df, decoded_full_sequence, mut_count
+def evaluate_literature_mut_performance(predicted_mutations_df, literature_mutations_df, decoded_full_sequence, mut_count, sequence="", focus_region_start=0, focus_region_end=0, offset=0):
+    """
+    Given a dataframe of predicted mutations and literature mutations, see how well the predicted mutations did
+    """
+    log_update("\t\tComparing predicted mutations to literature-provided mutations")
+    return_df = predicted_mutations_df.copy(deep=True)
+    return_df['Literature Mutation'] = [np.nan]*len(return_df)
+    return_df['Top 1 Hit'] = [np.nan]*len(return_df)
+    return_df['Top 3 Hit'] = [np.nan]*len(return_df)
+    return_df['Top 4 Hit'] = [np.nan]*len(return_df)
+    return_df['Top 5 Hit'] = [np.nan]*len(return_df)
+    return_df['Top 10 Hit'] = [np.nan]*len(return_df)
+    log_update(f"\tFormula: new position = {focus_region_start} + lit_position - {offset}")
+    # Iterate through the literature mutations rows
+    for i, row in literature_mutations_df.iterrows():
+        lit_position = row['Position']
+        lit_mutations = row['Mutation']
+        original_residue = row['Original Residue']
+        seq_position = focus_region_start + (lit_position - offset) # find position of the sequence
+        matching_row = return_df[return_df['Position'] == seq_position]
+        matching_row_index = matching_row.index
+        matching_residue = matching_row.iloc[0]['Original Residue']
+        match = original_residue==matching_residue
+        log_update(f"\tLit pos: {lit_position}, OG residue: {original_residue}, Full sequence pos: {seq_position}, Full sequence residue: {matching_residue}\n\t\tMatch: {match}")
+        # Iterate through the matching rows. We are at the right spot if we have the right original residue.
+        if match:
+            top_mutation = matching_row.iloc[0]['Top Mutation']   # get top 3 mutations
+            top_mutation = top_mutation.split(',')
+            print(top_mutation)
+            return_df.loc[matching_row_index, 'Literature Mutation'] = lit_mutations  # get desired mutation
+            # If we got any of the mutatios reported in the literature, hit!
+            if any(letter in lit_mutations for letter in top_mutation):
+                return_df.loc[matching_row_index, 'Top 1 Hit'] = True
+            else:
+                return_df.loc[matching_row_index, 'Top 1 Hit'] = False
+            for k in [3,4,5,10]:
+                top_k_mutations = matching_row.iloc[0][f'Top {k} Mutations']   # get top 3 mutations
+                top_k_mutations = top_k_mutations.split(",")
+                print(top_k_mutations)
+                return_df.loc[matching_row_index, 'Literature Mutation'] = lit_mutations  # get desired mutation
+                # If we got any of the mutatios reported in the literature, hit!
+                if any(letter in lit_mutations for letter in top_k_mutations):
+                    return_df.loc[matching_row_index, f'Top {k} Hit'] = True
+                else:
+                    return_df.loc[matching_row_index, f'Top {k} Hit'] = False
+    return return_df, (decoded_full_sequence, mut_count, (mut_count/len(sequence)) * 100)
+def evaluate_eml4_alk(model, tokenizer, device, model_str):
+    alk_muts = pd.read_csv("alk_mutations.csv")
+    decoded_full_sequence, mut_count = None, None
+    EML4_ALK_SEQ = np.nan       ## not publicly available
+    cons_domain_alk = np.nan    # no publicly available
+    focus_region_start = EML4_ALK_SEQ.find(cons_domain_alk)
+    if os.path.isfile(f"eml4_alk_mutations/{model_str}/mutated_df.csv"):
+        log_update(f"Mutation predictions for {model_str} have already been calculated. Loading from eml4_alk_mutations/{model_str}/mutated_df.csv")
+        mutated_df = pd.read_csv(f"eml4_alk_mutations/{model_str}/mutated_df.csv")
+        mutated_summary = pd.read_csv(f"eml4_alk_mutations/{model_str}/mutated_summary.csv")
+        decoded_full_sequence = mutated_summary['decoded_full_sequence'][0]
+        mut_count = mutated_summary['mut_count'][0]
+    else:
+        mutated_df, decoded_full_sequence, mut_count = predict_positionwise_mutations(model, tokenizer, device, EML4_ALK_SEQ)
+        mutated_summary = pd.DataFrame(data={'decoded_full_sequence':[decoded_full_sequence],'mut_count':[mut_count]})
+        mutated_df.to_csv(f"eml4_alk_mutations/{model_str}/mutated_df.csv",index=False)
+        mutated_summary.to_csv(f"eml4_alk_mutations/{model_str}/mutated_summary.csv",index=False)
+    lit_performance_df, (mut_seq, mut_count, mut_pcnt) = evaluate_literature_mut_performance(mutated_df, alk_muts, decoded_full_sequence, mut_count,
+                                                            sequence=EML4_ALK_SEQ,
+                                                            focus_region_start=focus_region_start,
+                                                            focus_region_end = focus_region_start + len(cons_domain_alk),
+                                                            offset=1115 # original: 1116
+                                                            )
+    return lit_performance_df, (mut_seq, mut_count, mut_pcnt)
+def evaluate_bcr_abl(model, tokenizer, device, model_str):
+    abl_muts = pd.read_csv("abl_mutations.csv")
+    decoded_full_sequence, mut_count = None, None
+    BCR_ABL_SEQ = np.nan    ## not publicly available
+    cons_domain_abl = np.nan    ## not publicly available
+    focus_region_start = BCR_ABL_SEQ.find(cons_domain_abl)
+    if os.path.isfile(f"bcr_abl_mutations/{model_str}/mutated_df.csv"):
+        log_update(f"Mutation predictions for {model_str} have already been calculated. Loading from bcr_abl_mutations/{model_str}/mutated_df.csv")
+        mutated_df = pd.read_csv(f"bcr_abl_mutations/{model_str}/mutated_df.csv")
+        mutated_summary = pd.read_csv(f"bcr_abl_mutations/{model_str}/mutated_summary.csv")
+        decoded_full_sequence = mutated_summary['decoded_full_sequence'][0]
+        mut_count = mutated_summary['mut_count'][0]
+    else:
+        mutated_df, decoded_full_sequence, mut_count = predict_positionwise_mutations(model, tokenizer, device, BCR_ABL_SEQ)
+        mutated_summary = pd.DataFrame(data={'decoded_full_sequence':[decoded_full_sequence],'mut_count':[mut_count]})
+        mutated_df.to_csv(f"bcr_abl_mutations/{model_str}/mutated_df.csv",index=False)
+        mutated_summary.to_csv(f"bcr_abl_mutations/{model_str}/mutated_summary.csv",index=False)
+    lit_performance_df, (mut_seq, mut_count, mut_pcnt) = evaluate_literature_mut_performance(mutated_df, abl_muts, decoded_full_sequence, mut_count,
+                                                            sequence=BCR_ABL_SEQ,
+                                                            focus_region_start=focus_region_start,
+                                                            focus_region_end = focus_region_start + len(cons_domain_abl),
+                                                            offset=241  # original: 242
+                                                            )
+    return lit_performance_df, (mut_seq, mut_count, mut_pcnt)
+def summarize_individual_performance(performance_df, path_to_lit_df):
+    """
+    performance_df = dataframe with stats on performance
+    path_to_lit_df = original dataframe
+    """
+    # Load original df
+    lit_muts = pd.read_csv(path_to_lit_df)
+    # Mutated Sequence,Original Residue,Position,Top 3 Mutations,Literature Mutation,Hit,All Probabilities
+    mut_rows = performance_df.loc[performance_df['Literature Mutation'].notna()].reset_index(drop=True)
+    mut_rows = mut_rows[['Original Residue','Position','Literature Mutation',
+                         'Top Mutation','Top 1 Hit',
+                         'Top 3 Mutations','Top 3 Hit',
+                         'Top 4 Mutations','Top 4 Hit',
+                         'Top 5 Mutations','Top 5 Hit',
+                         'Top 10 Mutations','Top 10 Hit'
+                         ]]
+    mut_rows_str = mut_rows.to_string(index=False)
+    mut_rows_str = "\t\t" + mut_rows_str.replace("\n","\n\t\t")
+    log_update(f"\tPerformance on all mutated positions shown below:\n{mut_rows_str}")
+    # Summarize: total hits, percentage of hits
+    total_original_muts = len(lit_muts)
+    for k in [1,3,4,5,10]:
+        total_hits = len(mut_rows.loc[mut_rows[f'Top {k} Hit']==True])
+        total_misses = len(mut_rows.loc[mut_rows[f'Top {k} Hit']==False])
+        total_potential_muts = total_hits+total_misses
+        hit_pcnt = round(100*total_hits/total_potential_muts, 2)
+        miss_pcnt = round(100*total_misses/total_potential_muts, 2)
+        log_update(f"\tTotal positions tested / total positions mutated in literature: {total_potential_muts}/{total_original_muts}")
+        log_update(f"\n\t\tTop {k} hit performance:\n\t\t\tHits:{total_hits} ({hit_pcnt}%)\n\t\t\tMisses:{total_misses} ({miss_pcnt}%)")
+def main():
+    os.makedirs('results',exist_ok=True)
+    output_dir = f'results/{get_local_time()}'
+    os.makedirs(output_dir,exist_ok=True)
+    os.makedirs("bcr_abl_mutations",exist_ok=True)
+    os.makedirs("eml4_alk_mutations",exist_ok=True)
+    with open_logfile(f"{output_dir}/mutation_discovery_log.txt"):
+        print_configpy(config)
+        # Make sure environment variables are set correctly
+        check_env_variables()
+        # Get device
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        log_update(f"Using device: {device}")
+        # Load fuson
+        fuson_ckpt_path = config.FUSON_PLM_CKPT
+        if fuson_ckpt_path=="FusOn-pLM":
+            fuson_ckpt_path="../../../.."
+            model_name = "fuson_plm"
+            model_epoch = "best"
+            model_str = f"fuson_plm/best"
+        else:
+            model_name = list(fuson_ckpt_path.keys())[0]
+            epoch = list(fuson_ckpt_path.values())[0]
+            fuson_ckpt_path = f'../../training/checkpoints/{model_name}/checkpoint_epoch_{epoch}'
+            model_name, model_epoch = fuson_ckpt_path.split('/')[-2::]
+            model_epoch = model_epoch.split('checkpoint_')[-1]
+            model_str = f"{model_name}/{model_epoch}"
+        log_update(f"\nLoading FusOn-pLM model from {fuson_ckpt_path}")
+        fuson_tokenizer = AutoTokenizer.from_pretrained(fuson_ckpt_path)
+        fuson_model = AutoModelForMaskedLM.from_pretrained(fuson_ckpt_path)
+        fuson_model.to(device)
+        fuson_model.eval()
+        # Evaluate BCR::ABL performance with FusOn
+        os.makedirs(f"bcr_abl_mutations/{model_name}",exist_ok=True)
+        os.makedirs(f"bcr_abl_mutations/{model_name}/{model_epoch}",exist_ok=True)
+        log_update("\tEvaluating performance on BCR::ABL mutation prediction with FusOn")
+        abl_lit_performance_fuson, (mut_seq, mut_count, mut_pcnt) = evaluate_bcr_abl(fuson_model, fuson_tokenizer, device, model_str)
+        abl_lit_performance_fuson.to_csv(f'{output_dir}/BCR_ABL_mutation_recovery_fuson.csv', index = False)
+        with open(f'{output_dir}/BCR_ABL_mutation_recovery_fuson_summary.txt', 'w') as f:
+            f.write(mut_seq)
+            f.write(f'number of mutations: {mut_count}')
+            f.write(f'percentage of seq mutated: {mut_pcnt}')
+        # Evaluate EML4::ALK performance with Fuson
+        os.makedirs(f"eml4_alk_mutations/{model_name}",exist_ok=True)
+        os.makedirs(f"eml4_alk_mutations/{model_name}/{model_epoch}",exist_ok=True)
+        log_update("\tEvaluating performance on EML4::ALK mutation prediction with FusOn")
+        alk_lit_performance_fuson, (mut_seq, mut_count, mut_pcnt) = evaluate_eml4_alk(fuson_model, fuson_tokenizer, device, model_str)
+        alk_lit_performance_fuson.to_csv(f'{output_dir}/EML4_ALK_mutation_recovery_fuson.csv', index = False)
+        with open(f'{output_dir}/EML4_ALK_mutation_recovery_fuson_summary.txt', 'w') as f:
+            f.write(mut_seq)
+            f.write(f'number of mutations: {mut_count}')
+            f.write(f'percentage of seq mutated: {mut_pcnt}')
+        ### Summarize
+        log_update("\nSummarizing FusOn-pLM performance on BCR::ABL")
+        summarize_individual_performance(abl_lit_performance_fuson, "abl_mutations.csv")
+        log_update("\nSummarizing FusOn-pLM performance on EML4::ALK")
+        summarize_individual_performance(alk_lit_performance_fuson, "alk_mutations.csv")
+if __name__ == "__main__":
+    main()

fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:779bac32bf18f7efdf843120274bdcd0ee2dd1a940655b7f1af1bc0d640bda1a
+size 18267

fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2fdc821e3e303aae226567cea331437eed8af96db43ebbd7487b7a48767de51e
+size 9375

fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/Supplementary Tables - EML4 ALK Mutations.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bbc2b69ac1e403b060cda54b753c0e8cd9c0bbac1c64cdf38128aaa23081e12d
+size 709

fuson_plm/benchmarking/mutation_prediction/recovery/results/final_public/Supplementary Tables - BCR ABL Mutations.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a100f40575539ae5009e0ee87aac73b4b7701ee108cacd417585a8e02155e5c1
+size 1380

fuson_plm/benchmarking/puncta/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ## Puncta Prediction Benchmark
-This folder contains all the data and code needed to perform the **puncta prediction benchmark** (Figure 3).
 ### From raw data to train/test splits
 To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory.

 ## Puncta Prediction Benchmark
+This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the **puncta prediction benchmark** (Figure 3).
 ### From raw data to train/test splits
 To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory.