Mutation Prediction benchmark
This folder contains all the data and code needed to perform the mutation prediction benchmark (Figure 5).
Recovery
In Fig. 5C, drug resistance mutations in BCR::ABL and EML4::ALK are recovered by FusOn-pLM. Since the full sequences are not publicly available, certain sections of the code and results files have been removed. All results for drug resistance mutation positions in each sequence are included in mutation_prediction/recovery
.
benchmarking/
βββ mutation_prediction/
βββ recovery/
βββ results/final_public/
βββ BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv
βββ Supplementary Tables - BCR ABL Mutations.csv
βββ EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv
βββ Supplementary Tables - EML4 ALK Mutations.csv
βββ abl_mutations.csv
βββ alk_mutations.csv
βββ color_recovered_mutations_public.ipynb
βββ recover_public.py
βββ config.py
In the π results
directory:
abl_mutations.csv
: raw data from the literature with BCR::ABL mutations (O'Hare et al. 2007)alk_mutations.csv
: raw data from the literature with EML4::ALK mutations (Elshatlawy et al. 2023)color_recovered_mutations_public.csv
: notebook file used to write PyMOL code for visualizations in Fig. 5Crecover_public.py
: python file to run the analysis. Since private sequences are removed, this notebook will not run, but it includes all steps taken.config.py
: small config file to specify which FusOn-pLM checkpoint is being used.
In the π results/final_public
directory,
BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv
: raw logits for every possible mutation at all positions in BCR::ABL with known drug resistance mutationsSupplementary Tables - BCR ABL Mutations.csv
: supplementary table showing the calculations going into the hit rate for BCR::ABLEML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv
: raw logits for every possible mutation at all positions in EML4::ALK with known drug resistance mutationsSupplementary Tables - EML4 ALK Mutations.csv
: supplementary table showing the calculations going into the hit rate for EML4::ALK
Discovery
The mutation_prediction/discovery/
directory contains all code and files needed to reproduce Fig. 5B and 5D, where mutations are predicted at each position in several fusion oncoproteins.
Data download
To help select TF and Kinase-containing fusions for investigation (Fig. 5B), Supplementary Table 3 from Salokas et al. 2020 was downloaded as a reference of transcription factors and kinases. In clean.py
, this data is processed.
benchmarking/
βββ mutation_prediction/
βββ discovery/
βββ raw_data/
βββ salokas_2020_tableS3.csv
βββ processed_data/
βββ test_seqs_tftf_kk.csv
βββ domain_conservation_fusions_inputfile.csv
raw_data/salokas_2020_tableS3.csv
: Supplementary Table 3 from Salokas et al. 2020processed_data/test_seqs_tftf_kk.csv
: fusion oncoproteins in the FusOn-pLM test set that have either a Transcription Factor (TF) or a kinase as either the head or tail.processed_data/domain_conservation_fusions_inputfile.csv
: input file for discovery, containing the longest sequences of EWSR1::FLI1, PAX3::FOXO1, TRIM24::RET, and ETV6::NTRK3.
Mutation discovery
Run discover.py
to perform discovery on the sequences specified in config.py
(either one or many):
# Model settings: where is the model you wish to use for mutation discovery?
FUSON_PLM_CKPT = "FusOn-pLM"
#### Fill in this sectinon if you have one input
# Sequence settings: need full sequence of fusion oncoprotein, and the bounds of region of interest
FULL_FUSION_SEQUENCE = ""
FUSION_NAME = "fusion_example"
START_RESIDUE_INDEX = 1
END_RESIDUE_INDEX = 100
N = 3 # number of mutations to predict per amio acid
#### Fill in this section if you have multiple input
PATH_TO_INPUT_FILE = "processed_data/domain_conservation_fusions_inputfile.csv" # if you don't have an input file and want to do one sequence, set this variable to None
# GPU Settings: which GPUs should be available to run this discovery?
CUDA_VISIBLE_DEVICES = "0"
To run, use:
nohup python discover.py > discover.out 2> discover.err &
- All results are stored in
mutation_prediction/results/<timestamp>
, wheretimestamp
is a unique string encoding the date and time when you started training.
Below are the FusOn-pLM paper results in results/final
:
benchmarking/
βββ mutation_prediction/
βββ discovery/
βββ results/final/
βββ EWSR1::FLI1/
βββ conservation_heatmap.png
βββ full_results_with_logits.csv
βββ predicted_tokens.csv
βββ raw_mutation_results.pkl
βββ PAX3::FOXO1/ # same format as results/final/EWSR1::FLI1...
βββ TRIM24::RET/ # same format as results/final/EWSR1::FLI1...
βββ ETV6::NTRK3/ # same format as results/final/EWSR1::FLI1...
In each fusion oncoprotein folder are:
conservation_heatmap.png
: the heatmap from Fig. 5Bfull_results_with_logits.csv
: predictions for every residue in the sequence. Columns = "Residue", "original_residue", "original_residue_logit", "all_logits", "top_3_mutations"predicted_tokens.csv
: simplified format, top three tokens per residue and 1/0 conserved/not conserved label. Columns = "Original Residue", "Predicted Residues", "Conserved", "Position"raw_mutation_results.pkl
: raw logits in dictionary format.
Plotting
Three scripts aid with plotting:
make_color_bar.py
: can be run to generateviridis_color_bar.png
, a labeled and scaled up version of the color bar used in the conservation heatmaps and to color an ETV6::NTRK3 structure (Fig. 5D)plot.py
: includes code for heatmap generation. Can be run to regenerate heatmaps.color_discovered_mutations.ipynb
: notebook used to generate a modified version of the ETV6::NTRK3 structure file, which has the logits predicted by FusOn-pLM as the b factor (processed_data/521_logit_bfactor.cif
, displayed in Fig. 5D, right). Also has PyMOL code for a head/tail visualization with recovered drug resistance mutations (displayed in Fig. 5D, left)