svincoff's picture
dependencies and embedding_exploration benchmark
c43fbc6

Mutation Prediction benchmark

This folder contains all the data and code needed to perform the mutation prediction benchmark (Figure 5).

Recovery

In Fig. 5C, drug resistance mutations in BCR::ABL and EML4::ALK are recovered by FusOn-pLM. Since the full sequences are not publicly available, certain sections of the code and results files have been removed. All results for drug resistance mutation positions in each sequence are included in mutation_prediction/recovery.

benchmarking/
└── mutation_prediction/ 
    └── recovery/ 
        └── results/final_public/
            β”œβ”€β”€  BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv  
            β”œβ”€β”€  Supplementary Tables - BCR ABL Mutations.csv
            β”œβ”€β”€  EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv
            β”œβ”€β”€  Supplementary Tables -  EML4 ALK Mutations.csv
        β”œβ”€β”€ abl_mutations.csv
        β”œβ”€β”€ alk_mutations.csv
        β”œβ”€β”€ color_recovered_mutations_public.ipynb
        β”œβ”€β”€ recover_public.py
        β”œβ”€β”€ config.py

In the πŸ“ results directory:

  • abl_mutations.csv: raw data from the literature with BCR::ABL mutations (O'Hare et al. 2007)
  • alk_mutations.csv: raw data from the literature with EML4::ALK mutations (Elshatlawy et al. 2023)
  • color_recovered_mutations_public.csv: notebook file used to write PyMOL code for visualizations in Fig. 5C
  • recover_public.py: python file to run the analysis. Since private sequences are removed, this notebook will not run, but it includes all steps taken.
  • config.py: small config file to specify which FusOn-pLM checkpoint is being used.

In the πŸ“ results/final_public directory,

  • BCR_ABL_mutation_recovery_fuson_mutated_pns_only.csv: raw logits for every possible mutation at all positions in BCR::ABL with known drug resistance mutations
  • Supplementary Tables - BCR ABL Mutations.csv: supplementary table showing the calculations going into the hit rate for BCR::ABL
  • EML4_ALK_mutation_recovery_fuson_mutated_pns_only.csv: raw logits for every possible mutation at all positions in EML4::ALK with known drug resistance mutations
  • Supplementary Tables - EML4 ALK Mutations.csv: supplementary table showing the calculations going into the hit rate for EML4::ALK

Discovery

The mutation_prediction/discovery/ directory contains all code and files needed to reproduce Fig. 5B and 5D, where mutations are predicted at each position in several fusion oncoproteins.

Data download

To help select TF and Kinase-containing fusions for investigation (Fig. 5B), Supplementary Table 3 from Salokas et al. 2020 was downloaded as a reference of transcription factors and kinases. In clean.py, this data is processed.

benchmarking/
└── mutation_prediction/ 
    └── discovery/ 
        └── raw_data/
            β”œβ”€β”€ salokas_2020_tableS3.csv
        └── processed_data/
            β”œβ”€β”€ test_seqs_tftf_kk.csv
            β”œβ”€β”€ domain_conservation_fusions_inputfile.csv
  • raw_data/salokas_2020_tableS3.csv: Supplementary Table 3 from Salokas et al. 2020
  • processed_data/test_seqs_tftf_kk.csv: fusion oncoproteins in the FusOn-pLM test set that have either a Transcription Factor (TF) or a kinase as either the head or tail.
  • processed_data/domain_conservation_fusions_inputfile.csv: input file for discovery, containing the longest sequences of EWSR1::FLI1, PAX3::FOXO1, TRIM24::RET, and ETV6::NTRK3.

Mutation discovery

Run discover.py to perform discovery on the sequences specified in config.py (either one or many):

# Model settings: where is the model you wish to use for mutation discovery?
FUSON_PLM_CKPT = "FusOn-pLM"

#### Fill in this sectinon if you have one input
# Sequence settings: need full sequence of fusion oncoprotein, and the bounds of region of interest
FULL_FUSION_SEQUENCE = ""
FUSION_NAME = "fusion_example"
START_RESIDUE_INDEX = 1
END_RESIDUE_INDEX = 100
N = 3           # number of mutations to predict per amio acid

#### Fill in this section if you have multiple input
PATH_TO_INPUT_FILE = "processed_data/domain_conservation_fusions_inputfile.csv"   # if you don't have an input file and want to do one sequence, set this variable to None

# GPU Settings: which GPUs should be available to run this discovery? 
CUDA_VISIBLE_DEVICES = "0"

To run, use:

nohup python discover.py > discover.out 2> discover.err &
  • All results are stored in mutation_prediction/results/<timestamp>, where timestamp is a unique string encoding the date and time when you started training.

Below are the FusOn-pLM paper results in results/final:

benchmarking/
└── mutation_prediction/ 
    └── discovery/ 
        └── results/final/
            └── EWSR1::FLI1/
                β”œβ”€β”€ conservation_heatmap.png
                β”œβ”€β”€ full_results_with_logits.csv
                β”œβ”€β”€ predicted_tokens.csv
                β”œβ”€β”€ raw_mutation_results.pkl
            └── PAX3::FOXO1/        # same format as results/final/EWSR1::FLI1...
            └── TRIM24::RET/        # same format as results/final/EWSR1::FLI1...
            └── ETV6::NTRK3/        # same format as results/final/EWSR1::FLI1...

In each fusion oncoprotein folder are:

  • conservation_heatmap.png: the heatmap from Fig. 5B
  • full_results_with_logits.csv: predictions for every residue in the sequence. Columns = "Residue", "original_residue", "original_residue_logit", "all_logits", "top_3_mutations"
  • predicted_tokens.csv: simplified format, top three tokens per residue and 1/0 conserved/not conserved label. Columns = "Original Residue", "Predicted Residues", "Conserved", "Position"
  • raw_mutation_results.pkl: raw logits in dictionary format.

Plotting

Three scripts aid with plotting:

  1. make_color_bar.py: can be run to generate viridis_color_bar.png, a labeled and scaled up version of the color bar used in the conservation heatmaps and to color an ETV6::NTRK3 structure (Fig. 5D)
  2. plot.py: includes code for heatmap generation. Can be run to regenerate heatmaps.
  3. color_discovered_mutations.ipynb: notebook used to generate a modified version of the ETV6::NTRK3 structure file, which has the logits predicted by FusOn-pLM as the b factor (processed_data/521_logit_bfactor.cif, displayed in Fig. 5D, right). Also has PyMOL code for a head/tail visualization with recovered drug resistance mutations (displayed in Fig. 5D, left)