File size: 5,357 Bytes
8d9d9da 3efa812 8d9d9da bae913a 8d9d9da |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
## Puncta Prediction Benchmark
This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the **puncta prediction benchmark** (Figure 3).
### From raw data to train/test splits
To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory.
```
data/
βββ raw_data/
βββ FOdb_puncta.csv
βββ FOdb_SD5.csv
benchmarking/
βββ puncta/
βββ clean.py
βββ cleaned_dataset_s4.csv
βββ splits.csv
βββ FOdb_physicochemical_embeddings.pkl
```
The `clean.py` script generates the following files:
- **`cleaned_dataset_s4.csv`**: clean version of `FOdb_puncta.csv`, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained.
- **`splits.csv`**: fusion oncoproteins from `cleaned_dataset_s4.csv`, labeled in the `split` column as either being part of the *train* set ("Expressed_Set" in FOdb) or *test* set ("Verification_Set" in FOdb). This dataset also features `nucleus`, `cytoplasm`, and `formation` columns of 1s and 0s. In `nucleus`, 1=forms a condensate in the nucleus, 0=does not; in `cytoplasm`, 1=forms a condensate in the cytoplasm, 0=does not; in `formation`, 1=forms a condensate at all, 0=does not.
- **`FOdb_physicochemical_embeddings.pkl`**: a dictionary where fusion proteins from `splits.csv` are they keys, and their feature vectors of 25 low-MI features from `cleaned_dataset_s4.csv` are the values.
### Training
`config.py` holds training configuations.
```
# Benchmarking configs
BENCHMARK_FUSONPLM = True # True if you want to benchmark a FusOn-pLM Model
# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= {}
# Model comparison configs
BENCHMARK_ESM = True # True if you want to benchmark ESM-2-650M
BENCHMARK_PROTT5 = True # True if you want to benchmark ProtT5
BENCHMARK_FO_PUNCTA_ML = True # True if you want to benchmark FO-Puncta-ML from the FOdb paper
# Overwriting configs
PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made.
# GPU configs
CUDA_VISIBLE_DEVICES="0" # GPUs to make visible for this process
```
<br>
`train.py` will train the XGBoost classifiers.
- All **results** are stored in `puncta/results/timestamp`, where `timestamp` is a unique string encoding the date and time when you started training.
- All **embeddings** made for training will be stored in a new folder called `puncta/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
```
benchmarking/
βββ puncta/
βββ embeddings/
βββ esm2_t33_650M_UR50D/...
βββ fuson_plm/...
βββ prot_t5_xl_half_uniref50_enc/...
βββ results/
βββ final/
βββ figures/
βββ cytoplasm_verificationFOs_barchart_source_data.csv
βββ cytoplasm_verificationFOs_barchart.png
βββ formation_verificationFOs_0.83thresh_barchart_source_data.csv
βββ formation_verificationFOs_0.83thresh_barchart.png
βββ nucleus_verificationFOs_barchart_source_data.csv
βββ nucleus_verificationFOs_barchart.png
βββ cytoplasm_verificationFOs_results.csv
βββ formation_verificationFOs_0.83thresh_results.csv
βββ nucleus_verificationFOs_results.csv
```
The following files are in `results/final/figures`:
- **`cytoplasm_verificationFOs_barchart.png`**: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (`cytoplasm_verificationFOs_barchart_source_data.csv`)
- **`formation_verificationFOs_0.83thresh_barchart.png`**: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (`formation_verificationFOs_0.83thresh_barchart_source_data.csv`)
- **`nucleus_verificationFOs_barchart.png`**: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (`nucleus_verificationFOs_barchart_source_data.csv`)
The raw data are included in `results/final` as `cytoplasm_verificationFOs_results.csv`, `formation_verificationFOs_0.83thresh_results.csv`, and `nucleus_verificationFOs_results.csv`.
If you train a new model, the equivalents of these files will be created in `results/timestamp` for your specific configurations set in `config.py`.
To run training, enter in terminal:
```
python train.py
```
To regenerate plots, run
```
python plot.py
```
|