fuson_plm/benchmarking/puncta/README.md · ChatterjeeLab/FusOn-pLM at main

Puncta Prediction Benchmark

This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the puncta prediction benchmark (Figure 3).

From raw data to train/test splits

To train the puncta predictors, we processed raw data from FOdb (Tripathi et al. 2023) Supplementary dataset 4 (fuson_plm/data/raw_data/FOdb_puncta.csv) and Supplementary dataset 5 (fuson_plm/data/raw_data/FODb_SD5.csv) using the file clean.py in the puncta directory.

data/                                 
└── raw_data/ 
    ├── FOdb_puncta.csv
    ├── FOdb_SD5.csv
    
benchmarking/
└── puncta/ 
    ├── clean.py
    ├── cleaned_dataset_s4.csv
    ├── splits.csv
    ├── FOdb_physicochemical_embeddings.pkl

The clean.py script generates the following files:

cleaned_dataset_s4.csv: clean version of FOdb_puncta.csv, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained.
splits.csv: fusion oncoproteins from cleaned_dataset_s4.csv, labeled in the split column as either being part of the train set ("Expressed_Set" in FOdb) or test set ("Verification_Set" in FOdb). This dataset also features nucleus, cytoplasm, and formation columns of 1s and 0s. In nucleus, 1=forms a condensate in the nucleus, 0=does not; in cytoplasm, 1=forms a condensate in the cytoplasm, 0=does not; in formation, 1=forms a condensate at all, 0=does not.
FOdb_physicochemical_embeddings.pkl: a dictionary where fusion proteins from splits.csv are they keys, and their feature vectors of 25 low-MI features from cleaned_dataset_s4.csv are the values.

Training

config.py holds training configuations.

# Benchmarking configs
BENCHMARK_FUSONPLM = True                           # True if you want to benchmark a FusOn-pLM Model

# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= {}

# Model comparison configs
BENCHMARK_ESM = True                                # True if you want to benchmark ESM-2-650M
BENCHMARK_PROTT5 = True                             # True if you want to benchmark ProtT5
BENCHMARK_FO_PUNCTA_ML = True                       # True if you want to benchmark FO-Puncta-ML from the FOdb paper

# Overwriting configs
PERMISSION_TO_OVERWRITE = False                     # if False, script will halt if it believes these embeddings have already been made. 

# GPU configs
CUDA_VISIBLE_DEVICES="0"                            # GPUs to make visible for this process

train.py will train the XGBoost classifiers.

All results are stored in puncta/results/timestamp, where timestamp is a unique string encoding the date and time when you started training.
All embeddings made for training will be stored in a new folder called puncta/embeddings/ with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

benchmarking/
└── puncta/ 
    └── embeddings/ 
        └── esm2_t33_650M_UR50D/...
        └── fuson_plm/...
        └── prot_t5_xl_half_uniref50_enc/...
    └── results/ 
        └── final/ 
            └── figures/
                ├── cytoplasm_verificationFOs_barchart_source_data.csv
                ├── cytoplasm_verificationFOs_barchart.png
                ├── formation_verificationFOs_0.83thresh_barchart_source_data.csv
                ├── formation_verificationFOs_0.83thresh_barchart.png
                ├── nucleus_verificationFOs_barchart_source_data.csv
                ├── nucleus_verificationFOs_barchart.png
            ├── cytoplasm_verificationFOs_results.csv
            ├── formation_verificationFOs_0.83thresh_results.csv
            ├── nucleus_verificationFOs_results.csv

The following files are in results/final/figures:

cytoplasm_verificationFOs_barchart.png: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (cytoplasm_verificationFOs_barchart_source_data.csv)
formation_verificationFOs_0.83thresh_barchart.png: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (formation_verificationFOs_0.83thresh_barchart_source_data.csv)
nucleus_verificationFOs_barchart.png: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (nucleus_verificationFOs_barchart_source_data.csv)

The raw data are included in results/final as cytoplasm_verificationFOs_results.csv, formation_verificationFOs_0.83thresh_results.csv, and nucleus_verificationFOs_results.csv.

If you train a new model, the equivalents of these files will be created in results/timestamp for your specific configurations set in config.py.

To run training, enter in terminal:

python train.py

To regenerate plots, run

python plot.py