File size: 5,357 Bytes

## Puncta Prediction Benchmark

This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the **puncta prediction benchmark** (Figure 3). 

### From raw data to train/test splits
To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory. 

```
data/                                 
└── raw_data/ 
    ├── FOdb_puncta.csv
    ├── FOdb_SD5.csv
    
benchmarking/
└── puncta/ 
    ├── clean.py
    ├── cleaned_dataset_s4.csv
    ├── splits.csv
    ├── FOdb_physicochemical_embeddings.pkl
```

The `clean.py` script generates the following files: 
- **`cleaned_dataset_s4.csv`**: clean version of `FOdb_puncta.csv`, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained. 
- **`splits.csv`**: fusion oncoproteins from `cleaned_dataset_s4.csv`, labeled in the `split` column as either being part of the *train* set ("Expressed_Set" in FOdb) or *test* set ("Verification_Set" in FOdb). This dataset also features `nucleus`, `cytoplasm`, and `formation` columns of 1s and 0s. In `nucleus`, 1=forms a condensate in the nucleus, 0=does not; in `cytoplasm`, 1=forms a condensate in the cytoplasm, 0=does not; in `formation`, 1=forms a condensate at all, 0=does not.
- **`FOdb_physicochemical_embeddings.pkl`**: a dictionary where fusion proteins from `splits.csv` are they keys, and their feature vectors of 25 low-MI features from `cleaned_dataset_s4.csv` are the values.

### Training

`config.py` holds training configuations. 

```
# Benchmarking configs
BENCHMARK_FUSONPLM = True                           # True if you want to benchmark a FusOn-pLM Model

# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= {}

# Model comparison configs
BENCHMARK_ESM = True                                # True if you want to benchmark ESM-2-650M
BENCHMARK_PROTT5 = True                             # True if you want to benchmark ProtT5
BENCHMARK_FO_PUNCTA_ML = True                       # True if you want to benchmark FO-Puncta-ML from the FOdb paper

# Overwriting configs
PERMISSION_TO_OVERWRITE = False                     # if False, script will halt if it believes these embeddings have already been made. 

# GPU configs
CUDA_VISIBLE_DEVICES="0"                            # GPUs to make visible for this process
```
<br>

`train.py` will train the XGBoost classifiers. 
- All **results** are stored in `puncta/results/timestamp`, where `timestamp` is a unique string encoding the date and time when you started training. 
- All **embeddings** made for training will be stored in a new folder called `puncta/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

```
benchmarking/
└── puncta/ 
    └── embeddings/ 
        └── esm2_t33_650M_UR50D/...
        └── fuson_plm/...
        └── prot_t5_xl_half_uniref50_enc/...
    └── results/ 
        └── final/ 
            └── figures/
                ├── cytoplasm_verificationFOs_barchart_source_data.csv
                ├── cytoplasm_verificationFOs_barchart.png
                ├── formation_verificationFOs_0.83thresh_barchart_source_data.csv
                ├── formation_verificationFOs_0.83thresh_barchart.png
                ├── nucleus_verificationFOs_barchart_source_data.csv
                ├── nucleus_verificationFOs_barchart.png
            ├── cytoplasm_verificationFOs_results.csv
            ├── formation_verificationFOs_0.83thresh_results.csv
            ├── nucleus_verificationFOs_results.csv
```

The following files are in `results/final/figures`:
- **`cytoplasm_verificationFOs_barchart.png`**: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (`cytoplasm_verificationFOs_barchart_source_data.csv`)
- **`formation_verificationFOs_0.83thresh_barchart.png`**: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (`formation_verificationFOs_0.83thresh_barchart_source_data.csv`)
- **`nucleus_verificationFOs_barchart.png`**: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (`nucleus_verificationFOs_barchart_source_data.csv`)

The raw data are included in `results/final` as `cytoplasm_verificationFOs_results.csv`, `formation_verificationFOs_0.83thresh_results.csv`, and `nucleus_verificationFOs_results.csv`.

If you train a new model, the equivalents of these files will be created in `results/timestamp` for your specific configurations set in `config.py`.

To run training, enter in terminal:
```
python train.py
```

To regenerate plots, run 
```
python plot.py
```