Puncta Prediction Benchmark
This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the puncta prediction benchmark (Figure 3).
From raw data to train/test splits
To train the puncta predictors, we processed raw data from FOdb (Tripathi et al. 2023) Supplementary dataset 4 (fuson_plm/data/raw_data/FOdb_puncta.csv
) and Supplementary dataset 5 (fuson_plm/data/raw_data/FODb_SD5.csv
) using the file clean.py
in the puncta
directory.
data/
βββ raw_data/
βββ FOdb_puncta.csv
βββ FOdb_SD5.csv
benchmarking/
βββ puncta/
βββ clean.py
βββ cleaned_dataset_s4.csv
βββ splits.csv
βββ FOdb_physicochemical_embeddings.pkl
The clean.py
script generates the following files:
cleaned_dataset_s4.csv
: clean version ofFOdb_puncta.csv
, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained.splits.csv
: fusion oncoproteins fromcleaned_dataset_s4.csv
, labeled in thesplit
column as either being part of the train set ("Expressed_Set" in FOdb) or test set ("Verification_Set" in FOdb). This dataset also featuresnucleus
,cytoplasm
, andformation
columns of 1s and 0s. Innucleus
, 1=forms a condensate in the nucleus, 0=does not; incytoplasm
, 1=forms a condensate in the cytoplasm, 0=does not; information
, 1=forms a condensate at all, 0=does not.FOdb_physicochemical_embeddings.pkl
: a dictionary where fusion proteins fromsplits.csv
are they keys, and their feature vectors of 25 low-MI features fromcleaned_dataset_s4.csv
are the values.
Training
config.py
holds training configuations.
# Benchmarking configs
BENCHMARK_FUSONPLM = True # True if you want to benchmark a FusOn-pLM Model
# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= {}
# Model comparison configs
BENCHMARK_ESM = True # True if you want to benchmark ESM-2-650M
BENCHMARK_PROTT5 = True # True if you want to benchmark ProtT5
BENCHMARK_FO_PUNCTA_ML = True # True if you want to benchmark FO-Puncta-ML from the FOdb paper
# Overwriting configs
PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made.
# GPU configs
CUDA_VISIBLE_DEVICES="0" # GPUs to make visible for this process
train.py
will train the XGBoost classifiers.
- All results are stored in
puncta/results/timestamp
, wheretimestamp
is a unique string encoding the date and time when you started training. - All embeddings made for training will be stored in a new folder called
puncta/embeddings/
with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
benchmarking/
βββ puncta/
βββ embeddings/
βββ esm2_t33_650M_UR50D/...
βββ fuson_plm/...
βββ prot_t5_xl_half_uniref50_enc/...
βββ results/
βββ final/
βββ figures/
βββ cytoplasm_verificationFOs_barchart_source_data.csv
βββ cytoplasm_verificationFOs_barchart.png
βββ formation_verificationFOs_0.83thresh_barchart_source_data.csv
βββ formation_verificationFOs_0.83thresh_barchart.png
βββ nucleus_verificationFOs_barchart_source_data.csv
βββ nucleus_verificationFOs_barchart.png
βββ cytoplasm_verificationFOs_results.csv
βββ formation_verificationFOs_0.83thresh_results.csv
βββ nucleus_verificationFOs_results.csv
The following files are in results/final/figures
:
cytoplasm_verificationFOs_barchart.png
: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (cytoplasm_verificationFOs_barchart_source_data.csv
)formation_verificationFOs_0.83thresh_barchart.png
: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (formation_verificationFOs_0.83thresh_barchart_source_data.csv
)nucleus_verificationFOs_barchart.png
: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (nucleus_verificationFOs_barchart_source_data.csv
)
The raw data are included in results/final
as cytoplasm_verificationFOs_results.csv
, formation_verificationFOs_0.83thresh_results.csv
, and nucleus_verificationFOs_results.csv
.
If you train a new model, the equivalents of these files will be created in results/timestamp
for your specific configurations set in config.py
.
To run training, enter in terminal:
python train.py
To regenerate plots, run
python plot.py