FusOn-pLM / fuson_plm /benchmarking /caid /README.md

mutation prediction discovery and recovery

3efa812 2 months ago

21 kB

	## CAID Benchmark

	This folder contains all the data and code needed to perform the CAID benchmark, where FusOn-pLM-Diso (a classifier built on FusOn-pLM embeddings) is used to predict per-residue disorder propensities (Figure 4C-F) and plot disorder properties (Figure 1C-1D, S1)

	### TL;DR
	The order in which to run the scripts:

	```
	python scrape_fusionpdb.py # pull FusionPDB structures
	python process_fusion_structures.py # process FusionPDB structures, and head/tail protein structures
	python clean.py # clean disorder data and structure data. Assemble train/test/benchmark splits
	python train.py # train models
	python analyze_fusion_preds.py # make box chart and line plot of model performance on fusion proteins
	python plot.py # plot AUROC of model performance, and additional figures based on disorder data
	```

	Additional notes:

	* `color_disorder_residues.ipynb` is used to plot fusion structures with pLDDT or disorder prediction color overlays.
	* We recommend using `nohup` to run longer scripts like `scrape_fusionpdb.py`, `process_fusion_structures.py`, `clean.py`, and `train.py`

	### Downloading raw disorder data
	Per-residue disorder predictions were used to train and test FusOn-pLM-Diso.

	1. flDPnn ([Hu et al. 2021](https://doi.org/10.1038/s41467-021-24773-7))
	1. At this [link](http://biomine.cs.vcu.edu/servers/flDPnn/?fbclid=IwZXh0bgNhZW0CMTEAAR0KO5CkNdkGC9e5O32S0QoG3BWOw6_egbnioXQNBSv3UC-m_b_dxh70Nnk_aem_z285WFCHdBLw3vOj7LL37A), scroll down to the bottom to find links to the [training](http://biomine.cs.vcu.edu/servers/flDPnn/data/flDPnn_Training_Annotation.txt) and [validation](http://biomine.cs.vcu.edu/servers/flDPnn/data/flDPnn_Validation_Annotation.txt) sets.
	2. IDP-CRF ([Liu et al. 2018](https://doi.org/10.3390/ijms19092483))
	1. Download zipped data from [this link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6164615/bin/ijms-19-02483-s001.zip), remove header and footer, and save as a FASTA file
	3. CAID2-Disorder-NOX ([Del Conte et al. 2023](https://doi.org/10.1002/prot.26582))
	1. Go to [CAID Round 2 Results](https://caid.idpcentral.org/challenge/results?fbclid=IwZXh0bgNhZW0CMTEAAR12dKaA0KywcT71FnyXIrrNS91pwGREsLiq5c2RmfdYl7L0VdUNG7jYai8_aem_tW6Wm9_11ZuiI_GKzbNZjA). Scroll to "Here you can download the references used in the CAID-2 challenge" and you'll find the following links.
	1. [disorder_nox.fasta](https://caid.idpcentral.org/assets/sections/challenge/static/references/2/disorder_nox.fasta)
	2. [predictions](https://caid.idpcentral.org/assets/sections/challenge/static/predictions/2/predictions.zip) made by all CAID2 participants; AUROC curves can be reconstructed from these

	Raw disorder data are stored in `caid/raw_data`
	```
	benchmarking/
	└── caid/
	└── raw_data/
	└── caid2_competition_results/...
	└── caid2_train_and_test_data/
	├── CAID-2_Disorder_NOX_Testing_Sequences.fasta
	├── flDPnn_Training_Dataset.txt
	├── flDPnn_Validation_Annotation.txt
	├── IDP-CRF_Training_Dataset.txt
	```
	- 📁 `raw_data/caid2_competition_results/`: folder containing raw predictions from CAID2 competitors, downloaded directly from the CAID2 website. Models: AlphaFold-disorder, AlphaFOld-rsa, DeepIDP-2L, disomine, DisoPred, DISOPRED3-diso, Dispredict3, ESpritz-D, flDPlr2, flDPnn, flDPnn2, flDPtr, IDP-Fusion, IUPred3.
	- `raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta`: Disorder-NOX dataset (used as the test set in this benchmark)
	- `raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt`: training set for flDPnn
	- `raw_data/caid2_train_and_test_dataflDPnn_Validation_Dataset.txt`: validation set for flDPnn
	- `raw_data/IDP-CRF_Training_Dataset.txt`: training set for IDP-CRF

	### Processing disorder data

	```
	benchmarking/
	└── caid/
	└── processed_data/
	└── caid2_competition_results/...
	├── CAID-2_Disorder_NOX_Processed.csv
	├── flDPnn_Training_Dataset.csv
	├── flDPnn_Validation_Dataset.csv
	├── IDP-CRF_Training_Dataset.csv
	└── splits/
	├── splits.csv
	├── train_df.csv
	├── test_df.csv
	├── fusion_bench_df.csv
	```

	The `clean.py` processes and combines the raw data files, generating the following files in 📁`processed_data/`:
	- 📁 `caid2_competition_results/`: a folder with table versions of all the files in 📁 `raw_data/caid2_competition_results/`
	- `CAID-2_Disorder_NOX_Processed.csv`: a table of test data, made by parsing `raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta`
	- `flDPnn_Training_Dataset.csv`: a table of flDPnn's training data, made by parsing `raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt`
	- `flDPnn_Validation_Dataset.csv`: a table of flDPnn's validation data, made by parsing `raw_data/caid2_train_and_test_data/flDPnn_Validation_Dataset.txt`
	- `IDP-CRF_Training_Dataset.csv`: a table of IDP-CRF's training data, made by parsing `raw_data/caid2_train_and_test_data/CRF_Training_Dataset.txt`

	`clean.py` also generates the final train-test splits and fusion oncoprotein benchmarking file used to train and evaluate the disorder predictors. These are stored in 📁`splits/`
	- `splits.csv`: sequences, IDs, split (either "Train", "Test", or "Fusion_Benchmark"), andpper-residue disorder labels based on AlphaFold-pLDDT (1 (disordered) if pLDDT< 68.8, 0 (ordered) if >=68.8)
	- `train_df.csv`: just the Train set portion of `splits.csv`
	- `test_df.csv`: just the Test set portion of `splits.csv`
	- `fusion_bench_df.csv`: just the Fusion_Benchmark portion of `splits.csv`. Includes 524 fusion oncoproteins from the FusOn-pLM test set whose structures were collected from FusionPDB (see "Downloading and Processing FusionPDB data


	### Downloading and Processing FusionPDB data
	The structures of fusion oncoproteins from the FusionPDB database were used to evaluate FusOn-pLM-Diso's performance on fusion oncoproteins. This data was collected by running `scrape_fusionpdb.py`, followed by `process_fusion_structures.py`. These scripts populated the `raw_data` and `processed_data` files simultaneously.

	Listed below are all the relevant files:

	```
	benchmarking/
	└── caid/
	└── raw_data/
	└── fusionpdb/
	└── structures/... # created by scrape_fusionpdb.py (folder not included in repo)
	└── head_tail_af2db_structures/... # created by process_fusion_structures.py (folder not included in repo)
	├── FusionPDB_level2_curated_09_05_2024.csv
	├── FusionPDB_level2_fusion_structure_links.csv
	├── FusionPDB_level3_curated_09_05_2024.csv
	├── FusionPDB_level3_fusion_structure_links.csv
	├── fusionpdb_structureless_ids.txt
	├── hgene_tgene_uniprot_idmap_07_10_2024.txt
	├── level2_head_tail_info.txt
	├── level3_head_tail_info.txt
	├── not_in_afdb_idmap.txt
	└── processed_data/
	└── fusion_pdb/
	└── intermediates/
	├── giant_level_2-3_fusion_protein_head_tail_info.csv
	├── giant_level2-3_fusion_protein_structure_links.csv
	├── giant_level2-3_fusion_protein_structures_processed.csv
	├── uniprotids_not_in_afdb.txt
	├── unmapped_parts.tt
	├── fusion_heads_and_tails.csv
	├── FusionPDB_level2-3_cleaned_FusionGID_info.csv
	├── FusionPDB_level2-3_cleaned_structure_info.csv
	├── heads_tails_structural_data.csv
	```

	#### ⚙️ Pipeline

	Here we describe what each script does and which files each script creates.
	1. 🐍 `scrape_fusionpdb.py`
	i. Scrapes metadata for FusionPDB Level 2 and Level 3
	a. Pulls the online tables for [Level 2](https://compbio.uth.edu/FusionPDB/gene_search_result_0.cgi?type=chooseLevel&chooseLevel=level2) and [Level 3](https://compbio.uth.edu/FusionPDB/gene_search_result_0.cgi?type=chooseLevel&chooseLevel=level3), saving results to `raw_data/FusionPDB_level2_curated_09_05_2024.csv` and `raw_data/FusionPDB_level3_curated_09_05_2024.csv` respectively.
	ii. Retrieves structure links
	a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3, and downloads all AlphaFold2 structure links for each FO.
	b. Saves results directly to `raw_data/FusionPDB_level2_fusion_structure_links.csv` and `raw_data/FusionPDB_level3_fusion_structure_links.csv`, respectively
	iii. Retrieves FO head gene and tail gene info
	a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3 to download head/tail info. Collects HGID and TGID (GeneIDs for head and tail) and UniProt accessions for each.
	b. Saves results directly to `raw_data/level2_head_tail_info.txt` and `raw_data/level3_head_tail_info.txt`, respectively.
	iv. Combines Level 2 and 3 head/tail data
	a. Merges `raw_data/level2_head_tail_info.txt` and `raw_data/level3_head_tail_info.txt` into a dataframe.
	b. Saves result at `processed_data/fusionpdb/fusion_heads_and_tails.csv` (columns="FusionGID","HGID","TGID","HGUniProtAcc","TGUniProtAcc")
	v. Combines Level 2 and 3 structure link data
	a. Joins structure link data with metadata for each of levels 2 and 3, then combines the result.
	b. Saves result at `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv`
	vi. Combines structure link data and metadata (result of step (v)) with head and tail data (result of step (iv)), and resolves any missing head/tail UniProt IDs.
	a. Merges the data
	b. Checks how many rows have either missing or wrong UniProt accessions for the head or tail gene, and compiles the gene symbols for online quering in the UniProt ID Mapping tool (`processed_data/fusionpdb/intermediates/unmapped_parts.txt`)
	c. Reads the UniProt ID Mapping result. Combines this data with FusionPDB-scraped data by matching FusionPDB's HGID (GeneID for head) and TGID (GeneID for tail) with the GeneID returned by UniProt.
	d. For any FO where FusionPDB lacked a UniProt ID for the head/tail, this ID is filled in from the UniProt ID Mapping result.
	e. Saves result to `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv`. Columns: "FusionGID","FusionGene","Hgene","Tgene","URL","HGID","TGID","HGUniProtAcc","TGUniProtAcc","HGUniProtAcc_Source","TGUniProtAcc_Source", where the "_Source" columns indicate whether the UniProt ID came from FusionPDB, or from the ID Map.
	vii. Downloads AlphaFold2 structures of FOs from FusionPDB.
	a. Using structure links from `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv` (step (v)), directly downloads `.pdb` and `.cif` files.
	b. Saves results in 📁`raw_data/fusionpdb/structures`


	2. 🐍 `process_fusion_structures.py`
	i. Determines pLDDT(s) for each FO structure.
	a. For each structure in 📁`raw_data/fusionpdb_structures/`, determines amino acid sequence, per-residue pLDDT, and average pLDDT from the AlphaFold2 structure.
	b. Saves results in `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structures_processed.csv`.
	ii. Downloads AlphaFold2 structures for all head and tail proteins
	a. Reads `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv` and collects all unique UniProt IDs for all head/tail proteins.
	b. For each UniProt ID, queries the AlphaFoldDB, downloads the AlphaFold2 structure (if available), and saves it to 📁`raw_data/fusionpdb/head_tail_af2db_structures/`. Saves files converted from PDB to CIF format in `mmcif_converted_files`. Then, extracts the sequence, per-residue pLDDT, and average pLDDT from the file.
	c. Saves any UniProt IDs that did not have structures in the AlphaFoldDB to: `processed_data/fusionpdb/intermediates/uniprotids_not_in_afdb.txt`. Most of these were very long, but the shorter ones were folded and their average pLDDTs were manually inputted. These were put back into the AlphaFold ID map to look for alternative UniProt IDs, and their results are in `not_in_afdb_idmap.txt`.
	d. Saves results to `processed_data/fusionpdb/heads_tails_structural_data.csv`
	iii. Cleans the dataase of level 2&3 structural info
	a. Drops rows where no structure was successfully downloaded
	b. Drops rows where the FO sequence from FusionPDB does not match the FO sequence from its own AlphaFold2 structure file
	c. ⭐️Saves two final, cleaned databases⭐️:
	a. ⭐️ `FusionPDB_level2-3_cleaned_FusionGID_info.csv`: includes ful IDs and structural information for the Hgene and Tgene of each FO. Columns = "FusionGID", "FusionGene", "Hgene", "Tgene", "URL", "HGID", "TGID", "HGUniProtAcc", "TGUniProtAcc", "HGUniProtAcc_Source", "TGUniProtAcc_Source", "HG_pLDDT", "HG_AA_pLDDTs", "HG_Seq", "TG_pLDDT", "TG_AA_pLDDTs", "TG_Seq".
	b. ⭐️ `FusionPDB_level2-3_cleaned_structure_info.csv`: includes full structural information for each FO. Columns = "FusionGID", "FusionGene", "Fusion_Seq", "Fusion_Length", "Hgene", "Hchr", "Hbp", "Hstrand", "Tgene", "Tchr", "Tbp", "Tstrand", "Level", "Fusion_Structure_Link", "Fusion_Structure_Type", "Fusion_pLDDT", "Fusion_AA_pLDDTs", "Fusion_Seq_Source"


	### Training

	The model is defined in `model.py` and `utils.py`. Training configs can be provided in `config.py`:

	```
	# Which models to benchmark
	BENCHMARK_FUSONPLM = True
	# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
	# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
	FUSONPLM_CKPTS= "FusOn-pLM"

	BENCHMARK_ESM = True

	# GPU configs
	CUDA_VISIBLE_DEVICES="0"

	# Overwriting configs
	PERMISSION_TO_OVERWRITE_EMBEDDINGS = False # if False, script will halt if it believes these embeddings have already been made.
	PERMISSION_TO_OVERWRITE_MODELS = False # if False, script will halt if it believes these embeddings have already been made.
	```

	`train.py` trains the models using embeddings indicated in `config.py`. It also performs a hyperparameter screen.
	- All results are stored in `caid/results/<timestamp>`, where `timestamp` is a unique string encoding the date and time when you started training.
	- All raw outputs from models are stored in `caid/trained_models/<embedding_path>`, where `embedding_path` represents the embeddings used to build the disorder predictor.
	- All embeddings made for training will be stored in a new folder called `caid/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

	Below is the FusOn-pLM-Diso raw outputs folder, `trained_models/fuson_plm/best/'. (ESM-2-650M-Diso has a folder in the same format, and future trained models will as well):

	```
	benchmarking/
	└── caid/
	└── trained_models/
	└── esm2_t33_650M_UR50D/best/
	└── fuson_plm/best/
	├── caid_hyperparam_screen_fusion_benchmark_metrics.csv
	├── caid_hyperparam_screen_fusion_benchmark_probs.csv
	├── caid_hyperparam_screen_test_metrics.csv
	├── caid_hyperparam_screen_test_probs.csv
	├── caid_train_losses.csv
	├── params.txt
	```

	- `caid_hyperparam_screen_fusion_benchmark_metrics.csv`: performance metrics (Accuracy, Precision, Recall, F1 Score, AUROC) for the top model on the fusion benchmark set (`splits/fusion_bench_df.csv`)
	- `caid_hyperparam_screen_fusion_benchmark_probs.csv`: for the fusion benchmark, raw probabilities of class 1 (disorder), threshold used to assign 0/1 based on maximized F1 score, prediction labels based on probabilities and threshold
	- `caid_hyperparam_screen_test_metrics.csv`: same as `caid_hyperparam_screen_fusion_benchmark_metrics.csv`, but for CAID2 Disorder-NOX (`splits/test_df.csv`)
	- `caid_hyperparam_screen_test_probs.csv`: same as `caid_hyperparam_screen_fusion_benchmark_probs`, but for CAID2 Disorder-NOX
	- `caid_train_losses.csv`: train losses over the 2 training epochs for top-performing model
	- `params.txt`: hyperparameters of top performing model

	Results from the FusOn-pLM manuscript are found in `results/final`. A few extra data files and plots are added by `analyze_fusion_preds.py`

	```
	benchmarking/
	└── caid/
	└── results/final
	├── best_caid_model_results.csv
	├── caid_hyperparam_screen_test_metrics.csv
	├── caid_hyperparam_screen_fusion_benchmark_metrics.csv
	├── caid_hyperparam_screen_train_losses.csv
	├── fusion_disorder_boxplots.png
	├── fusion_pred_disorder_r2.png
	├── fusion_disorder_boxplots_source_data.csv
	├── fusion_pred_disorder_r2_source_data.csv
	├── CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
	├── CAID_fpr_tpr_source_data.csv
	├── CAID_prediction_source_data.csv
	```

	- `best_caid_model_results.csv`: Summary file of hyperparameters, test set statistics, and fusion benchmark statistics for the best model of each type screened (ESM-2-650M, FusOn-pLM)
	- `caid_hyperparam_screen_fusion_benchmark_metrics.csv`: Fusion benchmark set statistics for full hyperparameter screen
	- `caid_hyperparam_screen_fusion_benchmark_metrics.csv`: Test set statistics for full hyperparameter screen
	- `caid_hyperparam_screen_train_losses.csv`: Train losses for full hyperparameter screen
	- 📊 `fusion_disorder_boxplots.png`: Fig. 4E, left (data directly used to produce the plot at `fusion_disorder_boxplots_source_data.csv`)
	- 📊 `fusion_pred_disorder_r2_source_data.csv`: Fig. 4E, right (data directly used to produce the plot at `fusion_pred_disorder_r2_source_data.csv`)
	- 📊 `CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png`: Fig. 4D (probabilities used at `CAID_prediction_source_data.csv`, FPR/TPR relationships directly used to make the plot at `CAID_fpr_tpr_source_data.csv`)

	To run the training script, use

	```
	nohup python train.py > train.out 2> train.err &
	```

	### Plotting

	The `plot.py` script generates many figures from the paper, alongside the formatted data directly used for plotting.

	```
	benchmarking/
	└── caid/
	└── results/final/
	├── CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
	└── processed_data/
	└── figures/
	└── fusion_disorder/
	├── plddt_sequence_EML4-ALK.png
	├── plddt_sequence_EML4::ALK_source_data.csv
	├── plddt_sequence_EWSR1-FLI1.png
	├── plddt_sequence_EWSR1::FLI1_source_data.csv
	├── plddt_sequence_PAX3-FOXO1.png
	├── plddt_sequence_PAX3::FOXO1_source_data.csv
	├── plddt_sequence_SS18-SSX1.png
	├── plddt_sequence_SS18::SSX1_source_data.csv
	└── histograms/
	├── disorder_nox_histogram.png
	├── disorder_nox_histogram_source_data.csv
	├── fusions_histogram.png
	├── fusions_histogram_source_data.csv
	├── heads_histogram.png
	├── heads_histogram_source_data.csv
	├── tails_histogram.png
	├── tails_histogram_source_data.csv
	```

	- Plots in `fusion_disorder` are from Fig. 1C
	- Plots in `hisograms` are from Fig. 1D and Fig. S1

	To regenerate these plots and source data, run:

	```
	python plot.py
	```

	### Colored structure images
	`color_disorder_residues.ipynb` is used to plot fusion structures with pLDDT or disorder prediction color overlays. By running certain (or all) of its cells, you will recreate images from Fig. 1C and 4F, as well as the following file:

	```
	benchmarking/
	└── caid/
	└── disorder_coloring_data
	├── normalized_disorder_propensities_source_data.csv
	```
	- `normalized_disorder_propensities_source_data.csv`: the normalized disorder propensities that were visualized on fusion structures in Fig. 4F