## IDR Property Prediction Benchmark

This folder contains all the data and code needed to perform the **IDR property prediction benchmark**, where FusOn-pLM-IDR (a regressor built on FusOn-pLM embeddings) is used to predict aggregate properties of intrinsically disordered regions (IDRs), specifically asphericity, end-to-end radius (R<sub>e</sub>), radius of gyration (R<sub>g</sub>), and polymer scaling exponent (Figure 4A-B). 

### TL;DR
The order in which to run the scripts, after downloading data:

```
python clean.py             # clean the data
python cluster.py           # MMSeqs2 clustering
python split.py             # make cluster-based train/val/test splits
python train.py             # train the model
python plot.py              # if you want to remake r2 plots
```

### Downloading raw IDR data
IDR properties from [Lotthammer et al. 2024](https://doi.org/10.1038/s41592-023-02159-5) (ALBATROSS model) were used to train FusOn-pLM-IDR. Sequences were downloaded from [this link](https://github.com/holehouse-lab/supportingdata/blob/master/2023/ALBATROSS_2023/simulations/data/all_sequences.tgz) and deposited in `raw_data`. All files in `raw_data` are from this direct download. 

```
benchmarking/
└── idr_prediction/ 
    └── raw_data/
        ├── asph_bio_synth_training_data_cleaned_05_09_2023.tsv
        ├── asph_nat_meth_test.tsv
        ├── scaled_re_bio_synth_training_data_cleaned_05_09_2023.tsv
        ├── scaled_re_nat_meth_test.tsv
        ├── scaled_rg_bio_synth_training_data_cleaned_05_09_2023.tsv
        ├── scaled_rg_nat_meth_test.tsv
        ├── scaling_exp_bio_synth_training_data_cleaned_05_09_2023.tsv
        ├── scaling_exp_nat_meth_test.tsv
```
- **`asph`**=asphericity, **`scaled_re`**=scaled R<sub>e</sub>, **`scaled_rg`**=scaled R<sub>g</sub>, **`scaling_exp`**=polymer scaling exponent
- **`<property>_bio_synth_training_data_cleaned_05_09_2023.tsv`** are ALBATROSS **training data** for the four properties, downloaded directly their GitHub
- **`<property>_nat_meth_test.tsv`** are ALBATROSS **testing data** for the four proeprties, downloaded directly from their GitHub 

### Cleaning raw IDR data
`clean.py` cleans the raw training and testing data separately for each property. Any duplicates (in both train and test) are removed from train and kept in test. Finally, the four are combined into one file: 

```
benchmarking/
└── idr_prediction/ 
    └── processed_data/
        ├── all_albatross_seqs_and_properties.csv
```

- **`all_albatross_seqs_and_properties.csv`**: Columns = "Sequence","IDs","UniProt_IDs","UniProt_Names","Split","asph","scaled_re","scaled_rg","scaling_exp". All splits are either "Train" or "Test", indicating ALBATROSS model's usage of them

To perform cleaning, run

```
python clean.py
```

### Using config.py for clustering, splitting, training

This file has configurations for clustering, splitting, training. 

```
# Clustering Parameters
CLUSTER = CustomParams(
    # MMSeqs2 parameters: see GitHub or MMSeqs2 Wiki for guidance
    MIN_SEQ_ID = 0.3,                                                   # % identity
    C = 0.5,                                                            # % sequence length overlap
    COV_MODE = 1,                                                       # cov-mode: 0 = bidirectional, 1 = target coverage, 2 = query coverage, 3 = target-in-query length coverage.
    CLUSTER_MODE = 2,
    # File paths
    INPUT_PATH = 'processed_data/all_albatross_seqs_and_properties.csv',
    PATH_TO_MMSEQS = '../../mmseqs'                                     # path to where you installed MMSeqs2   
)

# Split config
SPLIT = CustomParams(
    IDR_DB_PATH = 'processed_data/all_albatross_seqs_and_properties.csv',
    CLUSTER_OUTPUT_PATH = 'clustering/mmseqs_full_results.csv',    
    RANDOM_STATE_1 = 2,                                     # random_state_1 = state for splitting all data into train & other
    TEST_SIZE_1 = 0.21,                                     # test size for data -> train/test split. e.g. 20 means 80% clusters in train, 20% clusters in other
    RANDOM_STATE_2 = 6,                                     # random_state_2 = state for splitting other from ^ into val and test
    TEST_SIZE_2 = 0.50                                      # test size for train -> train/val split. e.g. 0.50 means 50% clusters in train, 50% clusters in test

)

# Which models to benchmark
TRAIN = CustomParams(
    BENCHMARK_FUSONPLM = True,
    FUSONPLM_CKPTS= "FusOn-pLM",                            # Dictionary: key = run name, values = epochs, or string "FusOn-pLM"
    BENCHMARK_ESM = True,

    # GPU configs
    CUDA_VISIBLE_DEVICES="0",

    # Overwriting configs
    PERMISSION_TO_OVERWRITE_EMBEDDINGS = False,             # if False, script will halt if it believes these embeddings have already been made. 
    PERMISSION_TO_OVERWRITE_MODELS = False                  # if False, script will halt if it believes these embeddings have already been made.
)
```

### Clustering 
Clustering of all sequences in `all_albatross_seqs_and_properties.csv` is performed by `cluster.py`.

The clustering command entered by the script is: 
```
mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.5 --cov-mode 1 --cluster-mode 2 --dbtype 1
```
The script will generate the following files:
```
benchmarking/
└── idr_prediction/ 
    └── clustering/
        ├── input.fasta
        ├── mmseqs_full_results.csv
```
- **`clustering/input.fasta`**: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in the `IDs` column of `processed_data/all_albatross_seqs_and_properties.csv`.)
- **`clustering/mmseqs_full_results.csv`**: clustering results. Columns:
    - `representative seq_id`: the seq_id of the sequence representing this cluster
    - `member seq_id`: the seq_id of a member of the cluster
    - `representative seq`: the amino acid sequence of the cluster representative (representative seq_id)
    - `member seq`: the amino acid sequence of the cluster member

### Splitting
Cluster-based splitting is performed by `split.py`. Results are formatted as follows:

```
benchmarking/
└── idr_prediction/ 
    └── splits/
        └── asph/
            ├── test_df.csv
            ├── val_df.csv
            ├── train_df.csv
        └── scaled_re/...   # same format as splits/asph
        └── scaled_rg/...   # same format as splits/asph
        └── scaling_exp/... # same format as splits/asph
        ├── test_cluster_split.csv
        ├── train_cluster_split.csv
        ├── val_cluster_split.csv
```

- **`<split>_cluster_split.csv`**: cluster information for the clusters in each split (train, val, test). Columns = "representative seq_id", "member seq_id", "representative seq", "member seq", "member length"
- 📁 **`asph/`**, **`scaled_re/`**, **`scaled_rg/`**, and **`scaling_exp/`** contain the train, val, and test sets for each property (`train_df.csv`, `val_df.csv`, and `test_df.csv`). The splits follow `<split>_cluster_split.csv`, but not every property has a measurement for each of these sequences. The train-val-test ratio still remains 80-10-10 for each property, despite the sequence losses. 

### Training
The model is defined in `model.py` and `utils.py`. The `train.py` script trains FusOn-pLM-IDR and ESM-2-650M-IDR models *separately for each property* (asphericity, R<sub>e</sub>, R<sub>g</sub>, scaling exponent) with a hyperparameter screen, saves all results separated by property, and makes plots. `plot.py` can be used to regenerate the R<sup>2</sup> plots. 
- All **results** are stored in `idr_prediction/results/<timestamp>`, where `timestamp` is a unique string encoding the date and time when you started training. 
- All **raw outputs from models** are stored in `idr_prediction/trained_models/<embedding_path>`, where `embedding_path` represents the embeddings used to build the disorder predictor. 
- All **embeddings** made for training will be stored in a new folder called `idr_prediction/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

Below is the FusOn-pLM-IDR raw outputs folder, `trained_models/fuson_plm/best/`, and the results from the paper, `results/final/`... 

The outputs are structured as follows:

```
benchmarking/
└── idr_prediction/ 
    └── results/final/
        └── r2_plots
            └── asph/
                ├── esm2_t33_650M_UR50D_asph_R2.png
                ├── esm2_t33_650M_UR50D_asph_R2_source_data.csv
                ├── fuson_plm_asph_R2.png
                ├── fuson_plm_asph_R2_source_data.csv
            └── scaled_re/          # same format as r2_plots/asph/...
            └── scaled_rg/          # same format as r2_plots/asph/...
            └── scaling_exp/        # same format as r2_plots/asph/...
        ├── asph_best_test_r2.csv  
        ├── asph_hyperparam_screen_test_r2.csv   
        ├── scaled_re_best_test_r2.csv  
        ├── scaled_re_hyperparam_screen_test_r2.csv
        ├── scaled_rg_best_test_r2.csv
        ├── scaled_rg_hyperparam_screen_test_r2.csv
        ├── scaling_exp_best_test_r2.csv
        ├── scaling_exp_hyperparam_screen_test_r2.csv
    └── trained_models/
        └── asph/
            └── fuson_plm/best/
                └── lr0.0001_bs32/
                    ├── asph_r2.csv
                    ├── train_val_losses.csv
                    ├── test_loss.csv
                    ├── asph_test_predictions.csv
                └── ... other hyperparameter folders with same format as lr0.001_bs32/ 
            └── esm2_t33_650M_UR50D         # same format as asph/fuson_plm/best/
        └── scaled_re/              # same format as asph/
        └── scaled_rg/              # same format as asph/
        └── scaling_exp/            # same format as asph/
        
```

In both directories, results are organized by IDR property and by the type of embedding used to train FusOn-pLM-IDR. 

In the 📁 `results/final` directory.
- 📁 **`r2_plots/<property>/`**: holds all R<sup>2</sup> plots and source data (the formatted data used to make the R<sup>2</sup> plots) for these properties. 
- **`<property>_best_test_r2.csv`**: holds the R<sup>2</sup> values for the top-performing models of each embedding type (e.g. ESM-2-650M and a specific checkpoint of FusOn-pLM) 
- **`<property>_hyperparam_screen_test_r2.csv`**: holds the R<sup>2</sup> values for all embedding types, for all screened hyperparaemters 

In the 📁 `trained_models` directory:
- 📁 `<property>/`: holds all results for all trained models predicting this property
- 📁 `asph/fuson_plm/best/`: holds all FusOn-pLM-IDR results on asphericity prediction for each set of hyperparameters screened when embeddings are made from "fuson_plm/best" (FusOn-pLM model). For example, 📁 `lr0.0001_bs32/` holds results for learning rate of 0.001, batch size 32. If you were to retrain your own checkpoint of fuson_plm and run the IDR prediction benchmark, its results would be stored in a new subfolder of  `trained_models/fuson_plm`.
- **`asph/fuson_plm/best/lr0.0001_bs32/asph_r2.csv`**: R<sup>2</sup> value for this set of hyperparameters with "fuson_plm/best" embeddings
- **`asph/fuson_plm/best/lr0.0001_bs32/asph_test_predictions.csv`**: true asphericity values of the test set proteins, alongside FusOn-pLM-IDR's predictions of them. 
- **`asph/fuson_plm/best/lr0.0001_bs32/test_loss.csv`**: FusOn-pLM-IDR's asphericity test loss value 
- **`asph/fuson_plm/best/lr0.0001_bs32/train_val_losses.csv`**: FusOn-pLM-IDR's tarining and validation loss over each epoch while training on asphericity data

To run the training script, enter:
```
nohup python train.py > train.out 2> train.err &
```

To run the plotting script, enter: 
```
python plot.py
```