fuson_plm/benchmarking/caid/README.md · ChatterjeeLab/FusOn-pLM at main

CAID Benchmark

This folder contains all the data and code needed to perform the CAID benchmark, where FusOn-pLM-Diso (a classifier built on FusOn-pLM embeddings) is used to predict per-residue disorder propensities (Figure 4C-F) and plot disorder properties (Figure 1C-1D, S1)

TL;DR

The order in which to run the scripts:

python scrape_fusionpdb.py              # pull FusionPDB structures
python process_fusion_structures.py     # process FusionPDB structures, and head/tail protein structures
python clean.py                         # clean disorder data and structure data. Assemble train/test/benchmark splits
python train.py                         # train models 
python analyze_fusion_preds.py          # make box chart and line plot of model performance on fusion proteins
python plot.py                          # plot AUROC of model performance, and additional figures based on disorder data

Additional notes:

color_disorder_residues.ipynb is used to plot fusion structures with pLDDT or disorder prediction color overlays.
We recommend using nohup to run longer scripts like scrape_fusionpdb.py, process_fusion_structures.py, clean.py, and train.py

Downloading raw disorder data

Per-residue disorder predictions were used to train and test FusOn-pLM-Diso.

flDPnn (Hu et al. 2021)
1. At this link, scroll down to the bottom to find links to the training and validation sets.
IDP-CRF (Liu et al. 2018)
1. Download zipped data from this link, remove header and footer, and save as a FASTA file
CAID2-Disorder-NOX (Del Conte et al. 2023)
1. Go to CAID Round 2 Results. Scroll to "Here you can download the references used in the CAID-2 challenge" and you'll find the following links.
  1. disorder_nox.fasta
  2. predictions made by all CAID2 participants; AUROC curves can be reconstructed from these

Raw disorder data are stored in caid/raw_data

benchmarking/
└── caid/ 
    └── raw_data/ 
        └── caid2_competition_results/...
        └── caid2_train_and_test_data/
            ├── CAID-2_Disorder_NOX_Testing_Sequences.fasta
            ├── flDPnn_Training_Dataset.txt
            ├── flDPnn_Validation_Annotation.txt
            ├── IDP-CRF_Training_Dataset.txt

📁 raw_data/caid2_competition_results/: folder containing raw predictions from CAID2 competitors, downloaded directly from the CAID2 website. Models: AlphaFold-disorder, AlphaFOld-rsa, DeepIDP-2L, disomine, DisoPred, DISOPRED3-diso, Dispredict3, ESpritz-D, flDPlr2, flDPnn, flDPnn2, flDPtr, IDP-Fusion, IUPred3.
raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta: Disorder-NOX dataset (used as the test set in this benchmark)
raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt: training set for flDPnn
raw_data/caid2_train_and_test_dataflDPnn_Validation_Dataset.txt: validation set for flDPnn
raw_data/IDP-CRF_Training_Dataset.txt: training set for IDP-CRF

Processing disorder data

benchmarking/
└── caid/ 
    └── processed_data/
        └── caid2_competition_results/...
        ├── CAID-2_Disorder_NOX_Processed.csv
        ├── flDPnn_Training_Dataset.csv
        ├── flDPnn_Validation_Dataset.csv
        ├── IDP-CRF_Training_Dataset.csv
    └── splits/
        ├── splits.csv
        ├── train_df.csv 
        ├── test_df.csv
        ├── fusion_bench_df.csv

The clean.py processes and combines the raw data files, generating the following files in 📁processed_data/:

📁 caid2_competition_results/: a folder with table versions of all the files in 📁 raw_data/caid2_competition_results/
CAID-2_Disorder_NOX_Processed.csv: a table of test data, made by parsing raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta
flDPnn_Training_Dataset.csv: a table of flDPnn's training data, made by parsing raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt
flDPnn_Validation_Dataset.csv: a table of flDPnn's validation data, made by parsing raw_data/caid2_train_and_test_data/flDPnn_Validation_Dataset.txt
IDP-CRF_Training_Dataset.csv: a table of IDP-CRF's training data, made by parsing raw_data/caid2_train_and_test_data/CRF_Training_Dataset.txt

clean.py also generates the final train-test splits and fusion oncoprotein benchmarking file used to train and evaluate the disorder predictors. These are stored in 📁splits/

splits.csv: sequences, IDs, split (either "Train", "Test", or "Fusion_Benchmark"), andpper-residue disorder labels based on AlphaFold-pLDDT (1 (disordered) if pLDDT< 68.8, 0 (ordered) if >=68.8)
train_df.csv: just the Train set portion of splits.csv
test_df.csv: just the Test set portion of splits.csv
fusion_bench_df.csv: just the Fusion_Benchmark portion of splits.csv. Includes 524 fusion oncoproteins from the FusOn-pLM test set whose structures were collected from FusionPDB (see "Downloading and Processing FusionPDB data

Downloading and Processing FusionPDB data

The structures of fusion oncoproteins from the FusionPDB database were used to evaluate FusOn-pLM-Diso's performance on fusion oncoproteins. This data was collected by running scrape_fusionpdb.py, followed by process_fusion_structures.py. These scripts populated the raw_data and processed_data files simultaneously.

Listed below are all the relevant files:

benchmarking/
└── caid/ 
    └── raw_data/ 
        └── fusionpdb/
            └── structures/... # created by scrape_fusionpdb.py (folder not included in repo)
            └── head_tail_af2db_structures/... # created by process_fusion_structures.py (folder not included in repo)
            ├── FusionPDB_level2_curated_09_05_2024.csv
            ├── FusionPDB_level2_fusion_structure_links.csv
            ├── FusionPDB_level3_curated_09_05_2024.csv
            ├── FusionPDB_level3_fusion_structure_links.csv
            ├── fusionpdb_structureless_ids.txt
            ├── hgene_tgene_uniprot_idmap_07_10_2024.txt
            ├── level2_head_tail_info.txt
            ├── level3_head_tail_info.txt
            ├── not_in_afdb_idmap.txt
    └── processed_data/
        └── fusion_pdb/
            └── intermediates/
                ├── giant_level_2-3_fusion_protein_head_tail_info.csv
                ├── giant_level2-3_fusion_protein_structure_links.csv
                ├── giant_level2-3_fusion_protein_structures_processed.csv
                ├── uniprotids_not_in_afdb.txt
                ├── unmapped_parts.tt
            ├── fusion_heads_and_tails.csv
            ├── FusionPDB_level2-3_cleaned_FusionGID_info.csv
            ├── FusionPDB_level2-3_cleaned_structure_info.csv
            ├── heads_tails_structural_data.csv

⚙️ Pipeline

Here we describe what each script does and which files each script creates.

🐍 scrape_fusionpdb.py i. Scrapes metadata for FusionPDB Level 2 and Level 3 a. Pulls the online tables for Level 2 and Level 3, saving results to raw_data/FusionPDB_level2_curated_09_05_2024.csv and raw_data/FusionPDB_level3_curated_09_05_2024.csv respectively. ii. Retrieves structure links a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3, and downloads all AlphaFold2 structure links for each FO. b. Saves results directly to raw_data/FusionPDB_level2_fusion_structure_links.csv and raw_data/FusionPDB_level3_fusion_structure_links.csv, respectively iii. Retrieves FO head gene and tail gene info a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3 to download head/tail info. Collects HGID and TGID (GeneIDs for head and tail) and UniProt accessions for each. b. Saves results directly to raw_data/level2_head_tail_info.txt and raw_data/level3_head_tail_info.txt, respectively. iv. Combines Level 2 and 3 head/tail data a. Merges raw_data/level2_head_tail_info.txt and raw_data/level3_head_tail_info.txt into a dataframe. b. Saves result at processed_data/fusionpdb/fusion_heads_and_tails.csv (columns="FusionGID","HGID","TGID","HGUniProtAcc","TGUniProtAcc") v. Combines Level 2 and 3 structure link data a. Joins structure link data with metadata for each of levels 2 and 3, then combines the result. b. Saves result at processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv vi. Combines structure link data and metadata (result of step (v)) with head and tail data (result of step (iv)), and resolves any missing head/tail UniProt IDs. a. Merges the data b. Checks how many rows have either missing or wrong UniProt accessions for the head or tail gene, and compiles the gene symbols for online quering in the UniProt ID Mapping tool (processed_data/fusionpdb/intermediates/unmapped_parts.txt) c. Reads the UniProt ID Mapping result. Combines this data with FusionPDB-scraped data by matching FusionPDB's HGID (GeneID for head) and TGID (GeneID for tail) with the GeneID returned by UniProt. d. For any FO where FusionPDB lacked a UniProt ID for the head/tail, this ID is filled in from the UniProt ID Mapping result. e. Saves result to processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv. Columns: "FusionGID","FusionGene","Hgene","Tgene","URL","HGID","TGID","HGUniProtAcc","TGUniProtAcc","HGUniProtAcc_Source","TGUniProtAcc_Source", where the "_Source" columns indicate whether the UniProt ID came from FusionPDB, or from the ID Map. vii. Downloads AlphaFold2 structures of FOs from FusionPDB. a. Using structure links from processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv (step (v)), directly downloads .pdb and .cif files. b. Saves results in 📁raw_data/fusionpdb/structures
🐍 process_fusion_structures.py i. Determines pLDDT(s) for each FO structure. a. For each structure in 📁raw_data/fusionpdb_structures/, determines amino acid sequence, per-residue pLDDT, and average pLDDT from the AlphaFold2 structure. b. Saves results in processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structures_processed.csv. ii. Downloads AlphaFold2 structures for all head and tail proteins a. Reads processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv and collects all unique UniProt IDs for all head/tail proteins. b. For each UniProt ID, queries the AlphaFoldDB, downloads the AlphaFold2 structure (if available), and saves it to 📁raw_data/fusionpdb/head_tail_af2db_structures/. Saves files converted from PDB to CIF format in mmcif_converted_files. Then, extracts the sequence, per-residue pLDDT, and average pLDDT from the file. c. Saves any UniProt IDs that did not have structures in the AlphaFoldDB to: processed_data/fusionpdb/intermediates/uniprotids_not_in_afdb.txt. Most of these were very long, but the shorter ones were folded and their average pLDDTs were manually inputted. These were put back into the AlphaFold ID map to look for alternative UniProt IDs, and their results are in not_in_afdb_idmap.txt. d. Saves results to processed_data/fusionpdb/heads_tails_structural_data.csv iii. Cleans the dataase of level 2&3 structural info a. Drops rows where no structure was successfully downloaded b. Drops rows where the FO sequence from FusionPDB does not match the FO sequence from its own AlphaFold2 structure file c. ⭐️Saves two final, cleaned databases⭐️: a. ⭐️ FusionPDB_level2-3_cleaned_FusionGID_info.csv: includes ful IDs and structural information for the Hgene and Tgene of each FO. Columns = "FusionGID", "FusionGene", "Hgene", "Tgene", "URL", "HGID", "TGID", "HGUniProtAcc", "TGUniProtAcc", "HGUniProtAcc_Source", "TGUniProtAcc_Source", "HG_pLDDT", "HG_AA_pLDDTs", "HG_Seq", "TG_pLDDT", "TG_AA_pLDDTs", "TG_Seq". b. ⭐️ FusionPDB_level2-3_cleaned_structure_info.csv: includes full structural information for each FO. Columns = "FusionGID", "FusionGene", "Fusion_Seq", "Fusion_Length", "Hgene", "Hchr", "Hbp", "Hstrand", "Tgene", "Tchr", "Tbp", "Tstrand", "Level", "Fusion_Structure_Link", "Fusion_Structure_Type", "Fusion_pLDDT", "Fusion_AA_pLDDTs", "Fusion_Seq_Source"

Training

The model is defined in model.py and utils.py. Training configs can be provided in config.py:

# Which models to benchmark
BENCHMARK_FUSONPLM = True
# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= "FusOn-pLM"

BENCHMARK_ESM = True

# GPU configs
CUDA_VISIBLE_DEVICES="0"

# Overwriting configs
PERMISSION_TO_OVERWRITE_EMBEDDINGS = False                     # if False, script will halt if it believes these embeddings have already been made. 
PERMISSION_TO_OVERWRITE_MODELS = False                          # if False, script will halt if it believes these embeddings have already been made.

train.py trains the models using embeddings indicated in config.py. It also performs a hyperparameter screen.

All results are stored in caid/results/<timestamp>, where timestamp is a unique string encoding the date and time when you started training.
All raw outputs from models are stored in caid/trained_models/<embedding_path>, where embedding_path represents the embeddings used to build the disorder predictor.
All embeddings made for training will be stored in a new folder called caid/embeddings/ with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

Below is the FusOn-pLM-Diso raw outputs folder, `trained_models/fuson_plm/best/'. (ESM-2-650M-Diso has a folder in the same format, and future trained models will as well):

benchmarking/
└── caid/ 
    └── trained_models/ 
        └── esm2_t33_650M_UR50D/best/
        └── fuson_plm/best/
            ├── caid_hyperparam_screen_fusion_benchmark_metrics.csv
            ├── caid_hyperparam_screen_fusion_benchmark_probs.csv
            ├── caid_hyperparam_screen_test_metrics.csv
            ├── caid_hyperparam_screen_test_probs.csv
            ├── caid_train_losses.csv
            ├── params.txt

caid_hyperparam_screen_fusion_benchmark_metrics.csv: performance metrics (Accuracy, Precision, Recall, F1 Score, AUROC) for the top model on the fusion benchmark set (splits/fusion_bench_df.csv)
caid_hyperparam_screen_fusion_benchmark_probs.csv: for the fusion benchmark, raw probabilities of class 1 (disorder), threshold used to assign 0/1 based on maximized F1 score, prediction labels based on probabilities and threshold
caid_hyperparam_screen_test_metrics.csv: same as caid_hyperparam_screen_fusion_benchmark_metrics.csv, but for CAID2 Disorder-NOX (splits/test_df.csv)
caid_hyperparam_screen_test_probs.csv: same as caid_hyperparam_screen_fusion_benchmark_probs, but for CAID2 Disorder-NOX
caid_train_losses.csv: train losses over the 2 training epochs for top-performing model
params.txt: hyperparameters of top performing model

Results from the FusOn-pLM manuscript are found in results/final. A few extra data files and plots are added by analyze_fusion_preds.py

benchmarking/
└── caid/ 
    └── results/final
        ├── best_caid_model_results.csv 
        ├── caid_hyperparam_screen_test_metrics.csv
        ├── caid_hyperparam_screen_fusion_benchmark_metrics.csv
        ├── caid_hyperparam_screen_train_losses.csv
        ├── fusion_disorder_boxplots.png
        ├── fusion_pred_disorder_r2.png
        ├── fusion_disorder_boxplots_source_data.csv
        ├── fusion_pred_disorder_r2_source_data.csv
        ├── CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png   
        ├── CAID_fpr_tpr_source_data.csv 
        ├── CAID_prediction_source_data.csv

best_caid_model_results.csv: Summary file of hyperparameters, test set statistics, and fusion benchmark statistics for the best model of each type screened (ESM-2-650M, FusOn-pLM)
caid_hyperparam_screen_fusion_benchmark_metrics.csv: Fusion benchmark set statistics for full hyperparameter screen
caid_hyperparam_screen_fusion_benchmark_metrics.csv: Test set statistics for full hyperparameter screen
caid_hyperparam_screen_train_losses.csv: Train losses for full hyperparameter screen
📊 fusion_disorder_boxplots.png: Fig. 4E, left (data directly used to produce the plot at fusion_disorder_boxplots_source_data.csv)
📊 fusion_pred_disorder_r2_source_data.csv: Fig. 4E, right (data directly used to produce the plot at fusion_pred_disorder_r2_source_data.csv)
📊 CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png: Fig. 4D (probabilities used at CAID_prediction_source_data.csv, FPR/TPR relationships directly used to make the plot at CAID_fpr_tpr_source_data.csv)

To run the training script, use

nohup python train.py > train.out 2> train.err &

Plotting

The plot.py script generates many figures from the paper, alongside the formatted data directly used for plotting.

benchmarking/
└── caid/ 
    └── results/final/
        ├── CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
    └── processed_data/ 
        └── figures/ 
            └── fusion_disorder/ 
                ├── plddt_sequence_EML4-ALK.png 
                ├── plddt_sequence_EML4::ALK_source_data.csv
                ├── plddt_sequence_EWSR1-FLI1.png 
                ├── plddt_sequence_EWSR1::FLI1_source_data.csv
                ├── plddt_sequence_PAX3-FOXO1.png 
                ├── plddt_sequence_PAX3::FOXO1_source_data.csv
                ├── plddt_sequence_SS18-SSX1.png
                ├── plddt_sequence_SS18::SSX1_source_data.csv
            └── histograms/ 
                ├── disorder_nox_histogram.png 
                ├── disorder_nox_histogram_source_data.csv
                ├── fusions_histogram.png  
                ├── fusions_histogram_source_data.csv
                ├── heads_histogram.png  
                ├── heads_histogram_source_data.csv
                ├── tails_histogram.png
                ├── tails_histogram_source_data.csv

Plots in fusion_disorder are from Fig. 1C
Plots in hisograms are from Fig. 1D and Fig. S1

To regenerate these plots and source data, run:

python plot.py

Colored structure images

color_disorder_residues.ipynb is used to plot fusion structures with pLDDT or disorder prediction color overlays. By running certain (or all) of its cells, you will recreate images from Fig. 1C and 4F, as well as the following file:

benchmarking/
└── caid/ 
    └── disorder_coloring_data
        ├── normalized_disorder_propensities_source_data.csv

normalized_disorder_propensities_source_data.csv: the normalized disorder propensities that were visualized on fusion structures in Fig. 4F