svincoff commited on Jan 18

Commit

8d9d9da

1 Parent(s): 3527fb2

puncta benchmark

Browse files

Files changed (19) hide show

fuson_plm/benchmarking/puncta/FOdb_physicochemical_embeddings.pkl +3 -0
fuson_plm/benchmarking/puncta/README.md +95 -0
fuson_plm/benchmarking/puncta/__init__.py +0 -0
fuson_plm/benchmarking/puncta/clean.py +184 -0
fuson_plm/benchmarking/puncta/cleaned_dataset_s4.csv +3 -0
fuson_plm/benchmarking/puncta/cleaning_log.txt +3 -0
fuson_plm/benchmarking/puncta/config.py +17 -0
fuson_plm/benchmarking/puncta/plot.py +244 -0
fuson_plm/benchmarking/puncta/results/final/cytoplasm_verificationFOs_results.csv +3 -0
fuson_plm/benchmarking/puncta/results/final/figures/cytoplasm_verificationFOs_barchart.png +0 -0
fuson_plm/benchmarking/puncta/results/final/figures/cytoplasm_verificationFOs_barchart_source_data.csv +3 -0
fuson_plm/benchmarking/puncta/results/final/figures/formation_verificationFOs_0.83thresh_barchart.png +0 -0
fuson_plm/benchmarking/puncta/results/final/figures/formation_verificationFOs_0.83thresh_barchart_source_data.csv +3 -0
fuson_plm/benchmarking/puncta/results/final/figures/nucleus_verificationFOs_barchart.png +0 -0
fuson_plm/benchmarking/puncta/results/final/figures/nucleus_verificationFOs_barchart_source_data.csv +3 -0
fuson_plm/benchmarking/puncta/results/final/formation_verificationFOs_0.83thresh_results.csv +3 -0
fuson_plm/benchmarking/puncta/results/final/nucleus_verificationFOs_results.csv +3 -0
fuson_plm/benchmarking/puncta/splits.csv +3 -0
fuson_plm/benchmarking/puncta/train.py +155 -0

fuson_plm/benchmarking/puncta/FOdb_physicochemical_embeddings.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d78986f0724138ed83c72fd4274154ca3f96a09f5fd8ad94030493375788006
+size 168405

fuson_plm/benchmarking/puncta/README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+## Puncta Prediction Benchmark
+This folder contains all the data and code needed to perform the **puncta prediction benchmark** (Figure 3).
+### From raw data to train/test splits
+To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory.
+```
+data/
+└── raw_data/
+    ├── FOdb_puncta.csv
+    ├── FOdb_SD5.csv
+benchmarking/
+└── puncta/
+    ├── clean.py
+    ├── cleaned_dataset_s4.csv
+    ├── splits.csv
+    ├── FOdb_physicochemical_embeddings.pkl
+```
+The `clean.py` script generates the following files:
+- **`cleaned_dataset_s4.csv`**: clean version of `FOdb_puncta.csv`, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained.
+- **`splits.csv`**: fusion oncoproteins from `cleaned_dataset_s4.csv`, labeled in the `split` column as either being part of the *train* set ("Expressed_Set" in FOdb) or *test* set ("Verification_Set" in FOdb). This dataset also features `nucleus`, `cytoplasm`, and `formation` columns of 1s and 0s. In `nucleus`, 1=forms a condensate in the nucleus, 0=does not; in `cytoplasm`, 1=forms a condensate in the cytoplasm, 0=does not; in `formation`, 1=forms a condensate at all, 0=does not.
+- **`FOdb_physicochemical_embeddings.pkl`**: a dictionary where fusion proteins from `splits.csv` are they keys, and their feature vectors of 25 low-MI features from `cleaned_dataset_s4.csv` are the values.
+### Training
+`config.py` holds training configuations.
+```
+# Benchmarking configs
+BENCHMARK_FUSONPLM = True                           # True if you want to benchmark a FusOn-pLM Model
+# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
+# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
+FUSONPLM_CKPTS= {}
+# Model comparison configs
+BENCHMARK_ESM = True                                # True if you want to benchmark ESM-2-650M
+BENCHMARK_PROTT5 = True                             # True if you want to benchmark ProtT5
+BENCHMARK_FO_PUNCTA_ML = True                       # True if you want to benchmark FO-Puncta-ML from the FOdb paper
+# Overwriting configs
+PERMISSION_TO_OVERWRITE = False                     # if False, script will halt if it believes these embeddings have already been made.
+# GPU configs
+CUDA_VISIBLE_DEVICES="0"                            # GPUs to make visible for this process
+```
+<br>
+`train.py` will train the XGBoost classifiers.
+- All **results** are stored in `puncta/results/timestamp`, where `timestamp` is a unique string encoding the date and time when you started training.
+- All **embeddings** made for training will be stored in a new folder called `puncta/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
+```
+benchmarking/
+└── puncta/
+    └── embeddings/
+        └── esm2_t33_650M_UR50D/...
+        └── fuson_plm/...
+        └── prot_t5_xl_half_uniref50_enc/...
+    └── results/
+        └── final/
+            └── figures/
+                ├── cytoplasm_verificationFOs_barchart_source_data.csv
+                ├── cytoplasm_verificationFOs_barchart.png
+                ├── formation_verificationFOs_0.83thresh_barchart_source_data.csv
+                ├── formation_verificationFOs_0.83thresh_barchart.png
+                ├── nucleus_verificationFOs_barchart_source_data.csv
+                ├── nucleus_verificationFOs_barchart.png
+            ├── cytoplasm_verificationFOs_results.csv
+            ├── formation_verificationFOs_0.83thresh_results.csv
+            ├── nucleus_verificationFOs_results.csv
+```
+The following files are in `results/final/figures`:
+- **`cytoplasm_verificationFOs_barchart.png`**: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (`cytoplasm_verificationFOs_barchart_source_data.csv`)
+- **`formation_verificationFOs_0.83thresh_barchart.png`**: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (`formation_verificationFOs_0.83thresh_barchart_source_data.csv`)
+- **`nucleus_verificationFOs_barchart.png`**: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (`nucleus_verificationFOs_barchart_source_data.csv`)
+The raw data are included in `results/final` as `cytoplasm_verificationFOs_results.csv`, `formation_verificationFOs_0.83thresh_results.csv`, and `nucleus_verificationFOs_results.csv`.
+If you train a new model, the equivalents of these files will be created in `results/timestamp` for your specific configurations set in `config.py`.
+To run training, enter in terminal:
+```
+python train.py
+```
+To regnerate plots, run
+```
+python plot.py
+```

fuson_plm/benchmarking/puncta/__init__.py ADDED Viewed

File without changes

fuson_plm/benchmarking/puncta/clean.py ADDED Viewed

	@@ -0,0 +1,184 @@

+# Cleans raw data to prepare FO labels and embeddings
+from fuson_plm.utils.logging import open_logfile, log_update
+from fuson_plm.utils.data_cleaning import find_invalid_chars
+from fuson_plm.utils.constants import VALID_AAS
+import pandas as pd
+import numpy as np
+import pickle
+def find_localization(row):
+    puncta_status = row['Puncta_Status']
+    cytoplasm = (row['Cytoplasm']=='Punctate')
+    nucleus = (row['Nucleus']=='Punctate')
+    both = cytoplasm and nucleus
+    if puncta_status=='YES':
+        if both:
+            return 'Both'
+        else:
+            if cytoplasm:
+                return 'Cytoplasm'
+            if nucleus:
+                return 'Nucleus'
+    return np.nan
+def clean_s5(df):
+    log_update("Cleaning FOdb Supplementary Table 5")
+    # extract only the physicochemical features used by the FO-Puncta ML model
+    retained_features = df.loc[
+        df['Low MI Set: Used In ML Model'].isin(['Yes','Yet'])      # allow flexibility for typo in this DF
+    ]['Parameter Label (Sup Table 2 & Matlab Scripts)'].tolist()
+    retained_features = sorted(retained_features)
+    # log the result
+    log_update(f'\tIsolated the {len(retained_features)} low-MI features used to train ML model')
+    for i, feat in enumerate(retained_features): log_update(f'\t\t{i+1}. {feat}')
+    # return the result
+    return retained_features
+def make_label_df(df):
+    """
+    Input df should be cleaned s4
+    """
+    label_df = df[['FO_Name','AAseq','Localization','Puncta_Status','Dataset']].rename(columns={'FO_Name':'fusiongene','AAseq':'aa_seq','Dataset':'dataset'})
+    dataset_to_split_dict = {'Expressed_Set': 'train', 'Verification_Set': 'test'}
+    label_df['split'] = label_df['dataset'].apply(lambda x: dataset_to_split_dict[x])
+    label_df['nucleus'] = label_df['Localization'].apply(lambda x: 1 if x in ['Nucleus','Both'] else 0)
+    label_df['cytoplasm'] = label_df['Localization'].apply(lambda x: 1 if x in ['Cytoplasm','Both'] else 0)
+    label_df['formation'] = label_df['Puncta_Status'].apply(lambda x: 1 if x=='YES' else 0)
+    label_df = label_df[['fusiongene','aa_seq','dataset','split','nucleus','cytoplasm','formation']]
+    return label_df
+def make_embeddings(df, physicochemical_features):
+    feat_string = '\n\t' + '\n\t'.join([str(i)+'. '+feat for i,feat in enumerate(physicochemical_features)])
+    log_update(f"\nMaking phyisochemical feature vectors.\nFeature Order: {feat_string}")
+    embeddings = {}
+    aa_seqs = df['AAseq'].unique()
+    for seq in aa_seqs:
+        feats = df.loc[df['AAseq']==seq].reset_index(drop=True)[physicochemical_features].T[0].tolist()
+        embeddings[seq] = feats
+    return embeddings
+def clean_s4(df, retained_features):
+    log_update("Cleaning FOdb Supplementary Table 4")
+    df = df.loc[
+        df['Puncta_Status'].isin(['YES','NO'])
+    ].reset_index(drop=True)
+    log_update(f'\tRemoved invalid FOs (puncta status = "Other" or "Nucleolar"). Remaining FOs: {len(df)}')
+    # check for duplicate sequences
+    dup_seqs = df.loc[df['AAseq'].duplicated()]['AAseq'].unique()
+    log_update(f"\tTotal duplicated sequences: {len(dup_seqs)}")
+    # check for invalid characters
+    df['invalid_chars'] = df['AAseq'].apply(lambda x: find_invalid_chars(x, VALID_AAS))
+    all_invalid_chars = set().union(*df['invalid_chars'])
+    log_update(f"\tChecking for invalid characters...\n\t\tFound {len(all_invalid_chars)} invalid characters")
+    for c in all_invalid_chars:
+        subset = df.loc[df['AAseq'].str.contains(c)]['AAseq'].tolist()
+        for seq in subset:
+            log_update(f"\t\tInvalid char {c} at index {seq.index(c)}/{len(seq)-1} of sequence {seq}")
+    # going to just remove the "-" from the special sequence
+    df = df.drop(columns=['invalid_chars'])
+    df.loc[
+        df['AAseq'].str.contains('-'),'AAseq'
+    ] = df.loc[df['AAseq'].str.contains('-'),'AAseq'].item().replace('-','')
+    # change FO format to ::
+    df['FO_Name'] = df['FO_Name'].apply(lambda x: x.replace('_','::'))
+    log_update(f'\tChanged FO names to Head::Tail format')
+     # Isolate positive and negative sets
+    df['Localization'] = ['']*len(df)
+    df['Localization'] = df.apply(lambda row: find_localization(row), axis=1)
+    puncta_positive = df.loc[
+        df['Puncta_Status']=='YES'
+    ].reset_index(drop=True)
+    puncta_negative = df.loc[
+        df['Puncta_Status']=='NO'
+    ].reset_index(drop=True)
+    # Only keeping retained features
+    cols = list(df.columns)
+    mi_feats_included = set(retained_features).intersection(set(cols))
+    log_update(f"\tChecking for the {len(retained_features)} low-MI features... {len(mi_feats_included)} found")
+    # make sure all of these are no-na
+    for rf in retained_features:
+        # if there's NaN, log it. Make sure the only instances of np.nan are for Verification Set FOs.
+        if df[rf].isna().sum()>0:
+            nas = df.loc[df[rf].isna()]
+            log_update(f"\t\tFeature {rf} has {len(nas)} np.nan values in the following datasets:")
+            for k,v in nas['Dataset'].value_counts().items():
+                print(f'\t\t\t{k}: {v}')
+    df = df[['FO_Name', 'Nucleus', 'Nucleolus', 'Cytoplasm','Puncta_Status', 'Dataset', 'Localization', 'AAseq',
+             'Puncta.pred', 'Puncta.prob']+retained_features]
+    # Quantify localization
+    log_update(f'\n\tPuncta localization for {len(puncta_positive)} FOs where Puncta_Status==YES')
+    for k, v in puncta_positive['Localization'].value_counts().items():
+        pcnt = 100*v/sum(puncta_positive['Localization'].value_counts())
+        log_update(f'\t\t{k}: \t{v} ({pcnt:.2f}%)')
+    log_update("\tDataset breakdown...")
+    dataset_vc = df['Dataset'].value_counts()
+    expressed_puncta_statuses = df.loc[df['Dataset']=='Expressed_Set']['Puncta_Status'].value_counts()
+    expressed_positive_locs = puncta_positive.loc[puncta_positive['Dataset']=='Expressed_Set']['Localization'].value_counts()
+    verification_positive_locs = puncta_positive.loc[puncta_positive['Dataset']=='Verification_Set']['Localization'].value_counts()
+    verification_puncta_statuses = df.loc[df['Dataset']=='Verification_Set']['Puncta_Status'].value_counts()
+    for k, v in dataset_vc.items():
+        pcnt = 100*v/sum(dataset_vc)
+        log_update(f'\t\t{k}: \t{v} ({pcnt:.2f}%)')
+        if k=='Expressed_Set':
+            for key, val in expressed_puncta_statuses.items():
+                pcnt = 100*val/v
+                log_update(f'\t\t\t{key}: \t{val} ({pcnt:.2f}%)')
+                if key=='YES':
+                    log_update('\t\t\t\tLocalizations...')
+                    for key2, val2 in expressed_positive_locs.items():
+                        pcnt = 100*val2/val
+                        log_update(f'\t\t\t\t\t{key2}: \t{val2} ({pcnt:.2f}%)')
+        if k=='Verification_Set':
+            for key, val in verification_puncta_statuses.items():
+                pcnt = 100*val/v
+                log_update(f'\t\t\t{key}: \t{val} ({pcnt:.2f}%)')
+                if key=='YES':
+                    log_update('\t\t\t\tLocalizations...')
+                    for key2, val2 in verification_positive_locs.items():
+                        pcnt = 100*val2/val
+                        log_update(f'\t\t\t\t\t{key2}: \t{val2} ({pcnt:.2f}%)')
+    return df
+def main():
+    LOG_PATH = 'cleaning_log.txt'
+    FODB_S4_PATH = '../../data/raw_data/FOdb_puncta.csv'
+    FODB_S5_PATH = '../../data/raw_data/FOdb_SD5.csv'
+    with open_logfile(LOG_PATH):
+        s4 = pd.read_csv(FODB_S4_PATH)
+        s5 = pd.read_csv(FODB_S5_PATH)
+        retained_features = clean_s5(s5)
+        cleaned_s4 = clean_s4(s4, retained_features)
+        label_df = make_label_df(cleaned_s4)
+        embeddings = make_embeddings(cleaned_s4, retained_features)
+        # save the results
+        cleaned_s4.to_csv('cleaned_dataset_s4.csv', index=False)
+        log_update("\nSaved cleaned table S5 to cleaned_dataset_s4.csv")
+        label_df.to_csv('splits.csv', index=False)
+        log_update("\nSaved train-test splits with nucleus, cytoplasm, and formation labels to splits.csv")
+        with open('FOdb_physicochemical_embeddings.pkl','wb') as f:
+            pickle.dump(embeddings, f)
+        log_update("\nSaved physicochemical embeddings as a dictionary to FOdb_physicochemical_embeddings.pkl")
+if __name__ == '__main__':
+    main()

fuson_plm/benchmarking/puncta/cleaned_dataset_s4.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f9075866f3296746c83eac61caf5c871e3a6dd54a2986896c9fd71a5a11511c
+size 183523

fuson_plm/benchmarking/puncta/cleaning_log.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22677b05ff483b17390edc30e53555097e85ce7ac6aaa3cd04aece67d3963bc1
+size 3356

fuson_plm/benchmarking/puncta/config.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Benchmarking configs
+BENCHMARK_FUSONPLM = True                           # True if you want to benchmark a FusOn-pLM Model
+# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
+# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
+FUSONPLM_CKPTS= "FusOn-pLM"
+# Model comparison configs
+BENCHMARK_ESM = True                                # True if you want to benchmark ESM-2-650M
+BENCHMARK_PROTT5 = True                             # True if you want to benchmark ProtT5
+BENCHMARK_FO_PUNCTA_ML = True                       # True if you want to benchmark FO-Puncta-ML from the FOdb paper
+# Overwriting configs
+PERMISSION_TO_OVERWRITE = False                     # if False, script will halt if it believes these embeddings have already been made.
+# GPU configs
+CUDA_VISIBLE_DEVICES="0"                            # GPUs to make visible for this process

fuson_plm/benchmarking/puncta/plot.py ADDED Viewed

	@@ -0,0 +1,244 @@

+import matplotlib.pyplot as plt
+import matplotlib.patches as mpatches
+import seaborn as sns
+import pandas as pd
+import numpy as np
+import os
+import matplotlib.colors as mcolors
+from fuson_plm.utils.visualizing import set_font
+fo_puncta_db_training_thresh31 = pd.DataFrame(data={
+        'Model Type': ['fo_puncta_ml'],
+        'Model Name': ['fo_puncta_ml_literature'],
+        'Model Epoch': np.nan,
+        'Accuracy': 0.81,
+        'Precision': 0.78,
+        'Recall': 0.98,
+        'F1 Score': 0.87,
+        'AUROC': 0.88,
+        'AUPRC': 0.94
+})
+fo_puncta_db_verification_thresh83 = pd.DataFrame(data={
+        'Model Type': ['fo_puncta_ml'],
+        'Model Name': ['fo_puncta_ml_literature'],
+        'Model Epoch': np.nan,
+        'Accuracy': 0.79,
+        'Precision': 0.81,
+        'Recall': 0.89,
+        'F1 Score': 0.85,
+        'AUROC': 0.73,
+        'AUPRC': 0.82
+})
+# Method for lengthening the model name
+def lengthen_model_name(row):
+    name = row['Model Name']
+    epoch = row['Model Epoch']
+    if 'esm' in name:
+        return name
+    if 'puncta' in name:
+        return name
+    return f'{name}_e{epoch}'
+# Method for shortening the model name for display
+def shorten_model_name(row):
+    name = row['Model Name']
+    epoch = row['Model Epoch']
+    if 'esm' in name:
+        return 'ESM-2-650M'
+    if name=='fo_puncta_ml':
+        return 'FO-Puncta-ML'
+    if name=='fo_puncta_ml_literature':
+        return 'FO-Puncta-ML Lit'
+    if name=="prot_t5_xl_half_uniref50_enc":
+        return 'ProtT5-XL-U50'  # this is waht they call it in the paper
+    if 'snp_' in name:
+        prob_type = 'snp'
+    elif 'uniform_' in name:
+        prob_type = 'uni'
+    layers = name.split('layers')[0].split('_')[-1]
+    dt = name.split('mask')[1].split('-', 1)[1]
+    return f'{prob_type}_{layers}L_{dt}_e{epoch}'
+def make_final_bar(dataframe, title, save_path):
+    set_font()
+    df = dataframe.copy(deep=True)
+    # Pivot the DataFrame to have metrics as rows and names as columns, and reorder columns
+    pivot_df = df.pivot(index='Metric', columns='Name', values='Value')
+    ordered_columns = [x for x in ['FOdb','ProtT5-XL-U50', 'ESM-2-650M', 'FusOn-pLM'] if x in pivot_df.columns]
+    pivot_df = pivot_df[ordered_columns]
+    # Define the groups
+    engineered_embeddings = ['FOdb']
+    deep_learning_embeddings = ['ProtT5-XL-U50', 'ESM-2-650M', 'FusOn-pLM']
+    # Reorder the metrics
+    metric_order = ['Accuracy', 'Precision', 'Recall', 'F1', 'AUROC'][::-1]
+    pivot_df = pivot_df.reindex(metric_order)
+    # Plotting
+    fig, ax = plt.subplots(figsize=(8, 6), dpi=300)  # Increased figure size for better legend placement
+    # Define bar width and positions
+    bar_width = 0.2
+    indices = np.arange(len(pivot_df))
+    # Use a colorblind-friendly color scheme from tableau
+    color_map = {
+        #'One-Hot': "#999999",
+        'FOdb': "#E69F00",
+        'ESM-2-650M': "#F0E442",
+        'FusOn-pLM': "#FF69B4",
+        'ProtT5-XL-U50': "#00ccff" # light blue
+    }
+    colors = [color_map[col] for col in ordered_columns]
+    # Plot bars for each category and add them to appropriate legend groups
+    engineered_handles = []
+    deep_learning_handles = []
+    for i, (name, color) in enumerate(zip(pivot_df.columns, colors)):
+        bars = ax.barh(indices + i * bar_width, pivot_df[name], bar_width, label=name, color=color)
+        if name in engineered_embeddings:
+            engineered_handles.append(bars[0])
+        else:
+            deep_learning_handles.append(bars[0])
+    # Add bold black asterisks next to the winning bars for each category (could be multiple)
+    #for j, metric in enumerate(pivot_df.index):
+    #    max_value = pivot_df.loc[metric].max()
+    #    max_indices = pivot_df.loc[metric][pivot_df.loc[metric] == max_value].index
+    #    for max_name in max_indices:
+    #        max_index = list(pivot_df.columns).index(max_name)
+    #        ax.text(max_value + 0.01, j + max_index * bar_width - bar_width / 4, '*',
+    #                color='black', fontsize=12, fontweight='bold', ha='center', va='center')
+    # Set labels, ticks, and title
+    plt.xlabel('Value', fontsize=44)  # Adjusted font size
+    ax.set_yticks(indices + bar_width * 1.5)
+    ax.set_xlim([0, 1])
+    ax.set_yticklabels(pivot_df.index)
+    # make the xticklabels size 24
+    ax.tick_params(axis='x')
+    ax.set_title(title, fontsize=44)  # Adjusted font size
+    # Setting font size for tick labels
+    for label in plt.gca().get_xticklabels():
+        label.set_fontsize(32)  # Adjusted font size
+    for label in plt.gca().get_yticklabels():
+        label.set_fontsize(32)  # Adjusted font size
+    # Create two separate legends
+    if engineered_handles:
+        legend1 = fig.legend(
+            engineered_handles[::-1],
+            [emb for emb in engineered_embeddings if emb in ordered_columns][::-1],
+            loc='center left',
+            bbox_to_anchor=(1, 0.4),
+            title="Engineered Embeddings",
+            title_fontsize=24)  # Adjusted font size
+    if deep_learning_handles:
+        legend2 = fig.legend(
+            deep_learning_handles[::-1],
+            [emb for emb in deep_learning_embeddings if emb in ordered_columns][::-1],
+            loc='center left',
+            bbox_to_anchor=(1, 0.6),
+            title="Learned Embeddings",
+            title_fontsize=24)  # Adjusted font size
+    # Adjust legend text size
+    if engineered_handles:
+        ax.add_artist(legend1)
+        for text in legend1.get_texts():
+            text.set_fontsize(22)  # Adjusted font size
+        for handle in legend1.legendHandles:
+            if isinstance(handle, mpatches.Patch):
+                handle.set_height(15)  # Adjust height
+                handle.set_width(20)   # Adjust width
+            elif hasattr(handle, '_sizes'):
+                handle._sizes = [200]  # Increase marker size in the legend
+    if deep_learning_handles:
+        ax.add_artist(legend2)
+        for text in legend2.get_texts():
+            text.set_fontsize(22)  # Adjusted font size
+        for handle in legend2.legendHandles:
+            if isinstance(handle, mpatches.Patch):
+                handle.set_height(15)  # Adjust height
+                handle.set_width(20)   # Adjust width
+            elif hasattr(handle, '_sizes'):
+                handle._sizes = [200]  # Increase marker size in the legend
+    plt.tight_layout()  # Adjust layout to make room for the legends
+    # Save the plot to a file
+    plt.savefig(save_path, dpi=300, bbox_inches='tight')
+    plt.show()
+def prepare_data_for_bar(results_dir, task, split, thresh=None):
+    fname = f"{task}_{split}FOs_results.csv"
+    if thresh is not None: fname = f"{task}_{split}FOs_{thresh}thresh_results.csv"
+    image_save_path = results_dir + '/figures/' + fname.split('_results.csv')[0]+'_barchart.png'
+    data = pd.read_csv(f"{results_dir}/{fname}")
+    data = data.loc[
+        data['Model Name'].isin(['best',
+                          'fo_puncta_ml',
+                          'esm2_t33_650M_UR50D',
+                          'prot_t5_xl_half_uniref50_enc'])
+    ]
+    data = pd.DataFrame(data = {
+        'Name': data['Model Name'].tolist() * 5,
+        'Metric': ['Accuracy', 'Accuracy', 'Accuracy','Accuracy',
+               'Precision', 'Precision', 'Precision', 'Precision',
+               'Recall', 'Recall', 'Recall', 'Recall',
+               'F1', 'F1', 'F1','F1',
+               'AUROC', 'AUROC', 'AUROC','AUROC'],
+        'Value': data['Accuracy'].tolist() + data['Precision'].tolist() + data['Recall'].tolist() + data['F1 Score'].tolist() + data['AUROC'].tolist()
+    }
+    )
+    rename_dict = {'fo_puncta_ml': 'FOdb',
+                   'esm2_t33_650M_UR50D':'ESM-2-650M',
+                   'best':'FusOn-pLM',
+                   'prot_t5_xl_half_uniref50_enc': 'ProtT5-XL-U50'}
+    data['Name'] = data['Name'].map(rename_dict)
+    return data, image_save_path
+def make_all_final_bar_charts(results_dir):
+    # Puncta verification
+    data, image_save_path = prepare_data_for_bar(results_dir,"formation","verification",thresh=0.83)
+    data_cp = data.copy(deep=True)
+    data_cp["Value"] = data_cp["Value"].round(3)
+    data_cp.to_csv(image_save_path.replace(".png","_source_data.csv"),index=False)
+    make_final_bar(data, "Puncta Propensity", image_save_path)
+    # Nucleus verification
+    data, image_save_path = prepare_data_for_bar(results_dir,"nucleus","verification",thresh=None)
+    data_cp = data.copy(deep=True)
+    data_cp["Value"] = data_cp["Value"].round(3)
+    data_cp.to_csv(image_save_path.replace(".png","_source_data.csv"),index=False)
+    make_final_bar(data, "Nucleus Localization", image_save_path)
+    # Cytoplasm verification
+    data, image_save_path = prepare_data_for_bar(results_dir,"cytoplasm","verification",thresh=None)
+    data_cp = data.copy(deep=True)
+    data_cp["Value"] = data_cp["Value"].round(3)
+    data_cp.to_csv(image_save_path.replace(".png","_source_data.csv"),index=False)
+    make_final_bar(data, "Cytoplasm Localization", image_save_path)
+def main():
+    # Read in the input data
+    results_dir="results/final"
+    os.makedirs(f"{results_dir}/figures",exist_ok=True)
+    make_all_final_bar_charts(results_dir)
+if __name__ == '__main__':
+    main()

fuson_plm/benchmarking/puncta/results/final/cytoplasm_verificationFOs_results.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:800f935f72b089b357fb4b0ac22a4c75a09a4578e44fac2c20a297c60c76df76
+size 871

fuson_plm/benchmarking/puncta/results/final/figures/cytoplasm_verificationFOs_barchart.png ADDED Viewed

fuson_plm/benchmarking/puncta/results/final/figures/cytoplasm_verificationFOs_barchart_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:06aa241a68bff40ae38cd6d484c4ff3ebf4d8613fb0e671576a3f07b6977dbda
+size 470

fuson_plm/benchmarking/puncta/results/final/figures/formation_verificationFOs_0.83thresh_barchart.png ADDED Viewed

fuson_plm/benchmarking/puncta/results/final/figures/formation_verificationFOs_0.83thresh_barchart_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:85ad0497edcb0438fafe20d2807afb694114bdc3a73401ca0ed6b739baca1603
+size 472

fuson_plm/benchmarking/puncta/results/final/figures/nucleus_verificationFOs_barchart.png ADDED Viewed

fuson_plm/benchmarking/puncta/results/final/figures/nucleus_verificationFOs_barchart_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:73697291e6f1d8036fd089babbde87e39c30d040e98a2c20d71dfb202925e316
+size 472

fuson_plm/benchmarking/puncta/results/final/formation_verificationFOs_0.83thresh_results.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:72c68d45ca772a2bded7473803767c12dbafa4bac09bc10aed70a075c386682c
+size 888

fuson_plm/benchmarking/puncta/results/final/nucleus_verificationFOs_results.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:37d6fc7ec393c48756c286e01ddb942b8b98b03564f22a099d01e2bd537f33ca
+size 887

fuson_plm/benchmarking/puncta/splits.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44f627efa4f76a35b2a4a83be77ae8815c7b728e3ca2ca5127d8177789127f7e
+size 133807

fuson_plm/benchmarking/puncta/train.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import torch
+import time
+import pandas as pd
+import numpy as np
+import pickle
+import os
+from fuson_plm.benchmarking.xgboost_predictor import train_final_predictor, evaluate_predictor, train_predictor_xval
+from fuson_plm.benchmarking.embed import embed_dataset_for_benchmark
+import fuson_plm.benchmarking.puncta.config as config
+from fuson_plm.benchmarking.puncta.plot import make_all_final_bar_charts
+from fuson_plm.utils.logging import log_update, open_logfile, print_configpy, get_local_time, CustomParams
+def check_splits(df):
+    # make sure everything has a split
+    if len(df.loc[df['split'].isna()])>0:
+        raise Exception("Error: not every benchmarking sequence has been allocated to a split (train or test)")
+    # make sure the only things are train and test
+    if len({'train','test'} - set(df['split'].unique()))!=0:
+        raise Exception("Error: splits column should only have \'train\' and \'test\'.")
+    # make sure there are no duplicate sequences
+    if len(df.loc[df['aa_seq'].duplicated()])>0:
+        raise Exception("Error: duplicate sequences provided")
+def train_and_evaluate_puncta_predictor(details, splits_with_embeddings,outdir,task='nucleus',class1_thresh=0.5,n_estimators=50,tree_method="hist"):
+    """
+    task = 'nucleus', 'cytoplasm', or 'formation'
+    """
+    # unpack the details dictioanry
+    benchmark_model_type = details['model_type']
+    benchmark_model_name = details['model']
+    benchmark_model_epoch = details['epoch']
+    # prepare train and test sets for model
+    train_split = splits_with_embeddings.loc[splits_with_embeddings['split']=='train'].reset_index(drop=True)
+    test_split = splits_with_embeddings.loc[splits_with_embeddings['split']=='test'].reset_index(drop=True)
+    X_train = np.array(train_split['embedding'].tolist())
+    y_train = np.array(train_split[task].tolist())
+    X_test = np.array(test_split['embedding'].tolist())
+    y_test = np.array(test_split[task].tolist())
+    # Train the final model on all the data
+    clf = train_final_predictor(X_train, y_train, n_estimators=n_estimators, tree_method=tree_method)
+    # Evaluate it
+    automatic_stats_df, custom_stats_df = evaluate_predictor(clf, X_test, y_test, class1_thresh=class1_thresh)
+    # Add the model details back in
+    cols = list(automatic_stats_df.columns)
+    automatic_stats_df['Model Type'] = [benchmark_model_type]
+    automatic_stats_df['Model Name'] = [benchmark_model_name]
+    automatic_stats_df['Model Epoch'] = [benchmark_model_epoch]
+    newcols = ['Model Type','Model Name','Model Epoch'] + cols
+    automatic_stats_df = automatic_stats_df[newcols]
+    cols = list(custom_stats_df.columns)
+    custom_stats_df['Model Type'] = [benchmark_model_type]
+    custom_stats_df['Model Name'] = [benchmark_model_name]
+    custom_stats_df['Model Epoch'] = [benchmark_model_epoch]
+    newcols = ['Model Type','Model Name','Model Epoch'] + cols
+    custom_stats_df = custom_stats_df[newcols]
+    # Save automatic results (for nucleus and cytoplasm)
+    if task!="formation":
+        automatic_stats_path = f'{outdir}/{task}_verificationFOs_results.csv'
+        if not(os.path.exists(automatic_stats_path)):
+            automatic_stats_df.to_csv(automatic_stats_path,index=False)
+        else:
+            automatic_stats_df.to_csv(automatic_stats_path,mode='a',index=False,header=False)
+    # Save custom threshold results (only if it's formation)
+    if task=="formation":
+        custom_stats_path = f'{outdir}/{task}_verificationFOs_{class1_thresh}thresh_results.csv'
+        if not(os.path.exists(custom_stats_path)):
+            custom_stats_df.to_csv(custom_stats_path,index=False)
+        else:
+            custom_stats_df.to_csv(custom_stats_path,mode='a',index=False,header=False)
+def main():
+    # make output directory for this run
+    os.makedirs('results',exist_ok=True)
+    output_dir = f'results/{get_local_time()}'
+    os.makedirs(output_dir,exist_ok=True)
+    with open_logfile(f'{output_dir}/puncta_benchmark_log.txt'):
+        # print configurations
+        print_configpy(config)
+        # Verify that the environment variables are set correctly
+        os.environ['CUDA_VISIBLE_DEVICES'] = config.CUDA_VISIBLE_DEVICES
+        log_update("\nChecking on environment variables...")
+        log_update(f"\tCUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES')}")
+        # make embeddings if needed
+        all_embedding_paths = embed_dataset_for_benchmark(
+                                            fuson_ckpts=config.FUSONPLM_CKPTS,
+                                            input_data_path='splits.csv', input_fname='FOdb_puncta_sequences',
+                                            average=True, seq_col='aa_seq',
+                                            benchmark_fusonplm=config.BENCHMARK_FUSONPLM,
+                                            benchmark_esm=config.BENCHMARK_ESM,
+                                            benchmark_fo_puncta_ml=config.BENCHMARK_FO_PUNCTA_ML,
+                                            benchmark_prott5 = config.BENCHMARK_PROTT5,
+                                            overwrite=config.PERMISSION_TO_OVERWRITE)
+        # load the splits with labels
+        splits = pd.read_csv('splits.csv')
+        # perform some sanity checks on the splits
+        check_splits(splits)
+        n_train = len(splits.loc[splits['split']=='train'])
+        n_test = len(splits.loc[splits['split']=='test'])
+        log_update(f"\nSplit breakdown...\n\t{n_train} Training FOs\n\t{n_test} Verification FOs")
+        # set training constants
+        train_params = CustomParams(
+            N_ESTIMATORS = 50,
+            TREE_METHOD = "hist",
+            CLASS1_THRESHOLDS = {
+                'nucleus': 0.83,
+                'cytoplasm': 0.83,
+                'formation': 0.83
+            },
+        )
+        log_update("\nTraining configs:")
+        train_params.print_config(indent='\t')
+        log_update("\nTraining models")
+        # loop through the embedding paths and train each one
+        for embedding_path, details in all_embedding_paths.items():
+            log_update(f"\tBenchmarking embeddings at: {embedding_path}")
+            try:
+                with open(embedding_path, "rb") as f:
+                    embeddings = pickle.load(f)
+            except:
+                raise Exception(f"Cannot read embeddings from {embedding_path}")
+            # combine the embeddings and splits into one dataframe
+            splits_with_embeddings = pd.DataFrame.from_dict(embeddings.items())
+            splits_with_embeddings = splits_with_embeddings.rename(columns={0: 'aa_seq', 1: 'embedding'})
+            splits_with_embeddings = pd.merge(splits_with_embeddings, splits, on='aa_seq',how='left')
+            for task in ['nucleus','cytoplasm','formation']:
+                log_update(f"\t\tTask: {task}")
+                train_and_evaluate_puncta_predictor(details, splits_with_embeddings, output_dir, task=task,
+                                                    class1_thresh=train_params.CLASS1_THRESHOLDS[task],
+                                                    n_estimators=train_params.N_ESTIMATORS,tree_method=train_params.TREE_METHOD)
+        log_update(f"\nMaking summary figures:\n")
+        log_update(f"\tbar charts...")
+        os.makedirs(f"{output_dir}/figures",exist_ok=True)
+        make_all_final_bar_charts(output_dir)
+        log_update(f"\tDone.")
+if __name__ == '__main__':
+    main()