CAID Benchmark
This folder contains all the data and code needed to perform the CAID benchmark, where FusOn-pLM-Diso (a classifier built on FusOn-pLM embeddings) is used to predict per-residue disorder propensities (Figure 4C-F) and plot disorder properties (Figure 1C-1D, S1)
TL;DR
The order in which to run the scripts:
python scrape_fusionpdb.py # pull FusionPDB structures
python process_fusion_structures.py # process FusionPDB structures, and head/tail protein structures
python clean.py # clean disorder data and structure data. Assemble train/test/benchmark splits
python train.py # train models
python analyze_fusion_preds.py # make box chart and line plot of model performance on fusion proteins
python plot.py # plot AUROC of model performance, and additional figures based on disorder data
Additional notes:
color_disorder_residues.ipynb
is used to plot fusion structures with pLDDT or disorder prediction color overlays.- We recommend using
nohup
to run longer scripts likescrape_fusionpdb.py
,process_fusion_structures.py
,clean.py
, andtrain.py
Downloading raw disorder data
Per-residue disorder predictions were used to train and test FusOn-pLM-Diso.
- flDPnn (Hu et al. 2021)
- At this link, scroll down to the bottom to find links to the training and validation sets.
- IDP-CRF (Liu et al. 2018)
- Download zipped data from this link, remove header and footer, and save as a FASTA file
- CAID2-Disorder-NOX (Del Conte et al. 2023)
- Go to CAID Round 2 Results. Scroll to "Here you can download the references used in the CAID-2 challenge" and you'll find the following links.
- disorder_nox.fasta
- predictions made by all CAID2 participants; AUROC curves can be reconstructed from these
- Go to CAID Round 2 Results. Scroll to "Here you can download the references used in the CAID-2 challenge" and you'll find the following links.
Raw disorder data are stored in caid/raw_data
benchmarking/
βββ caid/
βββ raw_data/
βββ caid2_competition_results/...
βββ caid2_train_and_test_data/
βββ CAID-2_Disorder_NOX_Testing_Sequences.fasta
βββ flDPnn_Training_Dataset.txt
βββ flDPnn_Validation_Annotation.txt
βββ IDP-CRF_Training_Dataset.txt
- π
raw_data/caid2_competition_results/
: folder containing raw predictions from CAID2 competitors, downloaded directly from the CAID2 website. Models: AlphaFold-disorder, AlphaFOld-rsa, DeepIDP-2L, disomine, DisoPred, DISOPRED3-diso, Dispredict3, ESpritz-D, flDPlr2, flDPnn, flDPnn2, flDPtr, IDP-Fusion, IUPred3. raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta
: Disorder-NOX dataset (used as the test set in this benchmark)raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt
: training set for flDPnnraw_data/caid2_train_and_test_dataflDPnn_Validation_Dataset.txt
: validation set for flDPnnraw_data/IDP-CRF_Training_Dataset.txt
: training set for IDP-CRF
Processing disorder data
benchmarking/
βββ caid/
βββ processed_data/
βββ caid2_competition_results/...
βββ CAID-2_Disorder_NOX_Processed.csv
βββ flDPnn_Training_Dataset.csv
βββ flDPnn_Validation_Dataset.csv
βββ IDP-CRF_Training_Dataset.csv
βββ splits/
βββ splits.csv
βββ train_df.csv
βββ test_df.csv
βββ fusion_bench_df.csv
The clean.py
processes and combines the raw data files, generating the following files in πprocessed_data/
:
- π
caid2_competition_results/
: a folder with table versions of all the files in πraw_data/caid2_competition_results/
CAID-2_Disorder_NOX_Processed.csv
: a table of test data, made by parsingraw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta
flDPnn_Training_Dataset.csv
: a table of flDPnn's training data, made by parsingraw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt
flDPnn_Validation_Dataset.csv
: a table of flDPnn's validation data, made by parsingraw_data/caid2_train_and_test_data/flDPnn_Validation_Dataset.txt
IDP-CRF_Training_Dataset.csv
: a table of IDP-CRF's training data, made by parsingraw_data/caid2_train_and_test_data/CRF_Training_Dataset.txt
clean.py
also generates the final train-test splits and fusion oncoprotein benchmarking file used to train and evaluate the disorder predictors. These are stored in πsplits/
splits.csv
: sequences, IDs, split (either "Train", "Test", or "Fusion_Benchmark"), andpper-residue disorder labels based on AlphaFold-pLDDT (1 (disordered) if pLDDT< 68.8, 0 (ordered) if >=68.8)train_df.csv
: just the Train set portion ofsplits.csv
test_df.csv
: just the Test set portion ofsplits.csv
fusion_bench_df.csv
: just the Fusion_Benchmark portion ofsplits.csv
. Includes 524 fusion oncoproteins from the FusOn-pLM test set whose structures were collected from FusionPDB (see "Downloading and Processing FusionPDB data
Downloading and Processing FusionPDB data
The structures of fusion oncoproteins from the FusionPDB database were used to evaluate FusOn-pLM-Diso's performance on fusion oncoproteins. This data was collected by running scrape_fusionpdb.py
, followed by process_fusion_structures.py
. These scripts populated the raw_data
and processed_data
files simultaneously.
Listed below are all the relevant files:
benchmarking/
βββ caid/
βββ raw_data/
βββ fusionpdb/
βββ structures/... # created by scrape_fusionpdb.py (folder not included in repo)
βββ head_tail_af2db_structures/... # created by process_fusion_structures.py (folder not included in repo)
βββ FusionPDB_level2_curated_09_05_2024.csv
βββ FusionPDB_level2_fusion_structure_links.csv
βββ FusionPDB_level3_curated_09_05_2024.csv
βββ FusionPDB_level3_fusion_structure_links.csv
βββ fusionpdb_structureless_ids.txt
βββ hgene_tgene_uniprot_idmap_07_10_2024.txt
βββ level2_head_tail_info.txt
βββ level3_head_tail_info.txt
βββ not_in_afdb_idmap.txt
βββ processed_data/
βββ fusion_pdb/
βββ intermediates/
βββ giant_level_2-3_fusion_protein_head_tail_info.csv
βββ giant_level2-3_fusion_protein_structure_links.csv
βββ giant_level2-3_fusion_protein_structures_processed.csv
βββ uniprotids_not_in_afdb.txt
βββ unmapped_parts.tt
βββ fusion_heads_and_tails.csv
βββ FusionPDB_level2-3_cleaned_FusionGID_info.csv
βββ FusionPDB_level2-3_cleaned_structure_info.csv
βββ heads_tails_structural_data.csv
βοΈ Pipeline
Here we describe what each script does and which files each script creates.
π
scrape_fusionpdb.py
i. Scrapes metadata for FusionPDB Level 2 and Level 3 a. Pulls the online tables for Level 2 and Level 3, saving results toraw_data/FusionPDB_level2_curated_09_05_2024.csv
andraw_data/FusionPDB_level3_curated_09_05_2024.csv
respectively. ii. Retrieves structure links a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3, and downloads all AlphaFold2 structure links for each FO. b. Saves results directly toraw_data/FusionPDB_level2_fusion_structure_links.csv
andraw_data/FusionPDB_level3_fusion_structure_links.csv
, respectively iii. Retrieves FO head gene and tail gene info a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3 to download head/tail info. Collects HGID and TGID (GeneIDs for head and tail) and UniProt accessions for each. b. Saves results directly toraw_data/level2_head_tail_info.txt
andraw_data/level3_head_tail_info.txt
, respectively. iv. Combines Level 2 and 3 head/tail data a. Mergesraw_data/level2_head_tail_info.txt
andraw_data/level3_head_tail_info.txt
into a dataframe. b. Saves result atprocessed_data/fusionpdb/fusion_heads_and_tails.csv
(columns="FusionGID","HGID","TGID","HGUniProtAcc","TGUniProtAcc") v. Combines Level 2 and 3 structure link data a. Joins structure link data with metadata for each of levels 2 and 3, then combines the result. b. Saves result atprocessed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv
vi. Combines structure link data and metadata (result of step (v)) with head and tail data (result of step (iv)), and resolves any missing head/tail UniProt IDs. a. Merges the data b. Checks how many rows have either missing or wrong UniProt accessions for the head or tail gene, and compiles the gene symbols for online quering in the UniProt ID Mapping tool (processed_data/fusionpdb/intermediates/unmapped_parts.txt
) c. Reads the UniProt ID Mapping result. Combines this data with FusionPDB-scraped data by matching FusionPDB's HGID (GeneID for head) and TGID (GeneID for tail) with the GeneID returned by UniProt. d. For any FO where FusionPDB lacked a UniProt ID for the head/tail, this ID is filled in from the UniProt ID Mapping result. e. Saves result toprocessed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv
. Columns: "FusionGID","FusionGene","Hgene","Tgene","URL","HGID","TGID","HGUniProtAcc","TGUniProtAcc","HGUniProtAcc_Source","TGUniProtAcc_Source", where the "_Source" columns indicate whether the UniProt ID came from FusionPDB, or from the ID Map. vii. Downloads AlphaFold2 structures of FOs from FusionPDB. a. Using structure links fromprocessed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv
(step (v)), directly downloads.pdb
and.cif
files. b. Saves results in πraw_data/fusionpdb/structures
π
process_fusion_structures.py
i. Determines pLDDT(s) for each FO structure. a. For each structure in πraw_data/fusionpdb_structures/
, determines amino acid sequence, per-residue pLDDT, and average pLDDT from the AlphaFold2 structure. b. Saves results inprocessed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structures_processed.csv
. ii. Downloads AlphaFold2 structures for all head and tail proteins a. Readsprocessed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv
and collects all unique UniProt IDs for all head/tail proteins. b. For each UniProt ID, queries the AlphaFoldDB, downloads the AlphaFold2 structure (if available), and saves it to πraw_data/fusionpdb/head_tail_af2db_structures/
. Saves files converted from PDB to CIF format inmmcif_converted_files
. Then, extracts the sequence, per-residue pLDDT, and average pLDDT from the file. c. Saves any UniProt IDs that did not have structures in the AlphaFoldDB to:processed_data/fusionpdb/intermediates/uniprotids_not_in_afdb.txt
. Most of these were very long, but the shorter ones were folded and their average pLDDTs were manually inputted. These were put back into the AlphaFold ID map to look for alternative UniProt IDs, and their results are innot_in_afdb_idmap.txt
. d. Saves results toprocessed_data/fusionpdb/heads_tails_structural_data.csv
iii. Cleans the dataase of level 2&3 structural info a. Drops rows where no structure was successfully downloaded b. Drops rows where the FO sequence from FusionPDB does not match the FO sequence from its own AlphaFold2 structure file c. βοΈSaves two final, cleaned databasesβοΈ: a. βοΈFusionPDB_level2-3_cleaned_FusionGID_info.csv
: includes ful IDs and structural information for the Hgene and Tgene of each FO. Columns = "FusionGID", "FusionGene", "Hgene", "Tgene", "URL", "HGID", "TGID", "HGUniProtAcc", "TGUniProtAcc", "HGUniProtAcc_Source", "TGUniProtAcc_Source", "HG_pLDDT", "HG_AA_pLDDTs", "HG_Seq", "TG_pLDDT", "TG_AA_pLDDTs", "TG_Seq". b. βοΈFusionPDB_level2-3_cleaned_structure_info.csv
: includes full structural information for each FO. Columns = "FusionGID", "FusionGene", "Fusion_Seq", "Fusion_Length", "Hgene", "Hchr", "Hbp", "Hstrand", "Tgene", "Tchr", "Tbp", "Tstrand", "Level", "Fusion_Structure_Link", "Fusion_Structure_Type", "Fusion_pLDDT", "Fusion_AA_pLDDTs", "Fusion_Seq_Source"
Training
The model is defined in model.py
and utils.py
. Training configs can be provided in config.py
:
# Which models to benchmark
BENCHMARK_FUSONPLM = True
# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= "FusOn-pLM"
BENCHMARK_ESM = True
# GPU configs
CUDA_VISIBLE_DEVICES="0"
# Overwriting configs
PERMISSION_TO_OVERWRITE_EMBEDDINGS = False # if False, script will halt if it believes these embeddings have already been made.
PERMISSION_TO_OVERWRITE_MODELS = False # if False, script will halt if it believes these embeddings have already been made.
train.py
trains the models using embeddings indicated in config.py
. It also performs a hyperparameter screen.
- All results are stored in
caid/results/<timestamp>
, wheretimestamp
is a unique string encoding the date and time when you started training. - All raw outputs from models are stored in
caid/trained_models/<embedding_path>
, whereembedding_path
represents the embeddings used to build the disorder predictor. - All embeddings made for training will be stored in a new folder called
caid/embeddings/
with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
Below is the FusOn-pLM-Diso raw outputs folder, `trained_models/fuson_plm/best/'. (ESM-2-650M-Diso has a folder in the same format, and future trained models will as well):
benchmarking/
βββ caid/
βββ trained_models/
βββ esm2_t33_650M_UR50D/best/
βββ fuson_plm/best/
βββ caid_hyperparam_screen_fusion_benchmark_metrics.csv
βββ caid_hyperparam_screen_fusion_benchmark_probs.csv
βββ caid_hyperparam_screen_test_metrics.csv
βββ caid_hyperparam_screen_test_probs.csv
βββ caid_train_losses.csv
βββ params.txt
caid_hyperparam_screen_fusion_benchmark_metrics.csv
: performance metrics (Accuracy, Precision, Recall, F1 Score, AUROC) for the top model on the fusion benchmark set (splits/fusion_bench_df.csv
)caid_hyperparam_screen_fusion_benchmark_probs.csv
: for the fusion benchmark, raw probabilities of class 1 (disorder), threshold used to assign 0/1 based on maximized F1 score, prediction labels based on probabilities and thresholdcaid_hyperparam_screen_test_metrics.csv
: same ascaid_hyperparam_screen_fusion_benchmark_metrics.csv
, but for CAID2 Disorder-NOX (splits/test_df.csv
)caid_hyperparam_screen_test_probs.csv
: same ascaid_hyperparam_screen_fusion_benchmark_probs
, but for CAID2 Disorder-NOXcaid_train_losses.csv
: train losses over the 2 training epochs for top-performing modelparams.txt
: hyperparameters of top performing model
Results from the FusOn-pLM manuscript are found in results/final
. A few extra data files and plots are added by analyze_fusion_preds.py
benchmarking/
βββ caid/
βββ results/final
βββ best_caid_model_results.csv
βββ caid_hyperparam_screen_test_metrics.csv
βββ caid_hyperparam_screen_fusion_benchmark_metrics.csv
βββ caid_hyperparam_screen_train_losses.csv
βββ fusion_disorder_boxplots.png
βββ fusion_pred_disorder_r2.png
βββ fusion_disorder_boxplots_source_data.csv
βββ fusion_pred_disorder_r2_source_data.csv
βββ CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
βββ CAID_fpr_tpr_source_data.csv
βββ CAID_prediction_source_data.csv
best_caid_model_results.csv
: Summary file of hyperparameters, test set statistics, and fusion benchmark statistics for the best model of each type screened (ESM-2-650M, FusOn-pLM)caid_hyperparam_screen_fusion_benchmark_metrics.csv
: Fusion benchmark set statistics for full hyperparameter screencaid_hyperparam_screen_fusion_benchmark_metrics.csv
: Test set statistics for full hyperparameter screencaid_hyperparam_screen_train_losses.csv
: Train losses for full hyperparameter screen- π
fusion_disorder_boxplots.png
: Fig. 4E, left (data directly used to produce the plot atfusion_disorder_boxplots_source_data.csv
) - π
fusion_pred_disorder_r2_source_data.csv
: Fig. 4E, right (data directly used to produce the plot atfusion_pred_disorder_r2_source_data.csv
) - π
CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
: Fig. 4D (probabilities used atCAID_prediction_source_data.csv
, FPR/TPR relationships directly used to make the plot atCAID_fpr_tpr_source_data.csv
)
To run the training script, use
nohup python train.py > train.out 2> train.err &
Plotting
The plot.py
script generates many figures from the paper, alongside the formatted data directly used for plotting.
benchmarking/
βββ caid/
βββ results/final/
βββ CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
βββ processed_data/
βββ figures/
βββ fusion_disorder/
βββ plddt_sequence_EML4-ALK.png
βββ plddt_sequence_EML4::ALK_source_data.csv
βββ plddt_sequence_EWSR1-FLI1.png
βββ plddt_sequence_EWSR1::FLI1_source_data.csv
βββ plddt_sequence_PAX3-FOXO1.png
βββ plddt_sequence_PAX3::FOXO1_source_data.csv
βββ plddt_sequence_SS18-SSX1.png
βββ plddt_sequence_SS18::SSX1_source_data.csv
βββ histograms/
βββ disorder_nox_histogram.png
βββ disorder_nox_histogram_source_data.csv
βββ fusions_histogram.png
βββ fusions_histogram_source_data.csv
βββ heads_histogram.png
βββ heads_histogram_source_data.csv
βββ tails_histogram.png
βββ tails_histogram_source_data.csv
- Plots in
fusion_disorder
are from Fig. 1C - Plots in
hisograms
are from Fig. 1D and Fig. S1
To regenerate these plots and source data, run:
python plot.py
Colored structure images
color_disorder_residues.ipynb
is used to plot fusion structures with pLDDT or disorder prediction color overlays. By running certain (or all) of its cells, you will recreate images from Fig. 1C and 4F, as well as the following file:
benchmarking/
βββ caid/
βββ disorder_coloring_data
βββ normalized_disorder_propensities_source_data.csv
normalized_disorder_propensities_source_data.csv
: the normalized disorder propensities that were visualized on fusion structures in Fig. 4F