svincoff commited on Jan 18

Commit

6efd653

1 Parent(s): 0e3c3b0

data cleaning, blast, and splitting code with source data, also deleting unnecessary files

Browse files

Files changed (23) hide show

fuson_plm/data/README.md +7 -2
fuson_plm/data/blast/README.md +2 -1
model/model.pth → fuson_plm/data/blast/figures/identities_hist_source_data.csv +2 -2
fuson_plm/data/blast/plot.py +7 -1
fuson_plm/data/split_vis.py +14 -307
fuson_plm/data/splits/combined_plot.png +0 -0
fuson_plm/data/splits/split_vis/aa_comp.png +0 -0
fuson_plm/data/splits/split_vis/aa_comp_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/combined_plot.png +0 -0
fuson_plm/data/splits/split_vis/length_distributions.png +0 -0
fuson_plm/data/splits/split_vis/scatterplot.png +0 -0
fuson_plm/data/splits/split_vis/scatterplot_benchmark_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/scatterplot_test_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/scatterplot_train_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/scatterplot_val_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/shannon_entropy_plot.png +0 -0
fuson_plm/data/splits/split_vis/shannon_entropy_plot_test_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/shannon_entropy_plot_train_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/shannon_entropy_plot_val_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/test_lengths_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/train_lengths_source_data.csv +3 -0
fuson_plm/data/splits/split_vis/val_lengths_source_data.csv +3 -0
fuson_plm/utils/visualizing.py +87 -37

fuson_plm/data/README.md CHANGED Viewed

@@ -41,7 +41,7 @@ data/
 - **`cluster.py`**: script for clustering the processed data in fuson_db.csv. Print statements in this code produce `clustering_log.txt`.
 - **`config.py`**: configs for the cleaning, clustering, and splitting scripts.
 - **`split.py`**: script for splitting the data, post-clusteirng. Print statements in this code produce `splitting_log.txt`.
-- **`split_vis.py`** script with code for the plots in `splits/combined_plot.png`, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes)
 #### Usage
 To repeat our cleaning, clustering, and splitting process, proceed as follows.
@@ -85,7 +85,12 @@ python split.py
 This script will create the following files:
 - **`splits/train_cluster_split.csv`, `splits/val_cluster_split.csv`, `splits/test_cluster_split.csv`**: The subsets of `clustering/mmseqs_full_results.csv` that have been partitioned into the train, validation, and test sets respectively.
 - **`splits/train_df.csv`, `splits/val_df.csv`, `splits/test_df.csv`**: The train, validation, and testing splits used to train FusOn-pLM. Columns: `sequence`,`member length`
-- **`splits/combined_plot.png`**: plot displaying the composition of the train, validation, and test splits.
 ### BLAST
 We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the `blast` folder for more details.

 - **`cluster.py`**: script for clustering the processed data in fuson_db.csv. Print statements in this code produce `clustering_log.txt`.
 - **`config.py`**: configs for the cleaning, clustering, and splitting scripts.
 - **`split.py`**: script for splitting the data, post-clusteirng. Print statements in this code produce `splitting_log.txt`.
+- **`split_vis.py`** script with code for the plots in `splits/combined_plot.png`, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes). Note that many of the methods are defined in `fuson_plm/utils/visualizing.py`.
 #### Usage
 To repeat our cleaning, clustering, and splitting process, proceed as follows.
 This script will create the following files:
 - **`splits/train_cluster_split.csv`, `splits/val_cluster_split.csv`, `splits/test_cluster_split.csv`**: The subsets of `clustering/mmseqs_full_results.csv` that have been partitioned into the train, validation, and test sets respectively.
 - **`splits/train_df.csv`, `splits/val_df.csv`, `splits/test_df.csv`**: The train, validation, and testing splits used to train FusOn-pLM. Columns: `sequence`,`member length`
+- the **`split_vis`** folder, which contains all visualizations in Fig. S4 and the data that was directly plotted in these visualizations (`*_source_data.csv` files). Note that the individual subplots have slightly different dimensions than they do in the combined Fig. S4
+    - **`splits/split_vis/combined_plot.png`**: plot displaying the composition of the train, validation, and test splits (Fig. S4).
+    - **`splits/split_vis/length_distributions.png`**: plot displaying the length distributions of the train, validation, and test splits (Fig. 4A)
+    - **`splits/split_vis/shannon_entropy_plot.png`**: plot displaying the Shannon entropy distributions of train, validation, and test sets (Fig. 4B)
+    - **`splits/split_vis/scatterplot.png`**: plot displaying the cluster size distributions of the train, validation, and test sets (Fig. 4C)
+    - **`splits/split_vis/aa_comp.png`**: plot displaying the amino acid composition of the train, validation, and test splits (Fig. S4D).
 ### BLAST
 We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the `blast` folder for more details.

fuson_plm/data/blast/README.md CHANGED Viewed

@@ -30,6 +30,7 @@ data/
         ├── best_htg_alignments_swissprot_seqs.pkl
         ├── ht_uniprot_query.txt
     └── figures/
         ├── identities_hist.png
     ├── blast_fusions.py
     ├── extract_blast_seqs.py
@@ -40,7 +41,7 @@ data/
 - **`blast_fusions.py`**: script that will prepare FusOn-DB for BLAST, run BLAST against SwissProt (given you've installed BLAST software properly), extract top alignments and calculate statistics on the BLAST results, and make results plots. Print statements in this script create the log file `fusion_blast_log.txt`.
 - **`extract_blast_seqs.py`**: script that will extract sequences of all the head/tail proteins that formed the best alignment during BLAST, directly from the SwissProt BLAST database. Creates the file `blast_outputs/best_htg_alignments_swissprot_seqs.pkl`.
-- **`plot.py`**: script to make the plot found at `figures/identities_hist.png`. This plot displays the maximum % identity of each fusion oncoprotein sequence with a SwissProt sequence, based on BLAST. This plot is also automatically created by `blast_fusions.py`.
 - **`fuson_ht_db.csv`**: Database that merges FusOn-DB (`/*/FusOn-pLM/fuson_plm/data/fuson_db.csv`) with `/*/FusOn-pLM/fuson_plm/data/head_tail_data/htgenes_uniprotids.csv`, which simplifies the process of analyzing BLAST results. In FusOn-DB, certain amino acid sequences are associated with multiple fusion oncoproteins, whose names are comma-separated in the `fusiongenes` column. In `fuson_ht_db.csv`, the `fusiongenes` column is exploded such that exach row only has one fusion gene. Therefore, this database has more rows than FusOn-DB, and some duplicate sequences.
 To run BLAST search and analysis, we recommend using nohup as the process will take a long time.

         ├── best_htg_alignments_swissprot_seqs.pkl
         ├── ht_uniprot_query.txt
     └── figures/
+        ├── identities_hist_source_data.png
         ├── identities_hist.png
     ├── blast_fusions.py
     ├── extract_blast_seqs.py
 - **`blast_fusions.py`**: script that will prepare FusOn-DB for BLAST, run BLAST against SwissProt (given you've installed BLAST software properly), extract top alignments and calculate statistics on the BLAST results, and make results plots. Print statements in this script create the log file `fusion_blast_log.txt`.
 - **`extract_blast_seqs.py`**: script that will extract sequences of all the head/tail proteins that formed the best alignment during BLAST, directly from the SwissProt BLAST database. Creates the file `blast_outputs/best_htg_alignments_swissprot_seqs.pkl`.
+- **`plot.py`**: script to make the plot found at `figures/identities_hist.png` (Fig. 1B histogram). The exact data plotted in this histogram is in `figures/identities_hist_source_data`. This plot displays the maximum % identity of each fusion oncoprotein sequence with a SwissProt sequence, based on BLAST. This plot is also automatically created by `blast_fusions.py`.
 - **`fuson_ht_db.csv`**: Database that merges FusOn-DB (`/*/FusOn-pLM/fuson_plm/data/fuson_db.csv`) with `/*/FusOn-pLM/fuson_plm/data/head_tail_data/htgenes_uniprotids.csv`, which simplifies the process of analyzing BLAST results. In FusOn-DB, certain amino acid sequences are associated with multiple fusion oncoproteins, whose names are comma-separated in the `fusiongenes` column. In `fuson_ht_db.csv`, the `fusiongenes` column is exploded such that exach row only has one fusion gene. Therefore, this database has more rows than FusOn-DB, and some duplicate sequences.
 To run BLAST search and analysis, we recommend using nohup as the process will take a long time.

model/model.pth → fuson_plm/data/blast/figures/identities_hist_source_data.csv RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:30c595b39e6f75c4d0d7d8d46eb6252931b0fe5841707396027521577ebf9798
-size 2609657850

 version https://git-lfs.github.com/spec/v1
+oid sha256:5f2219c7ab63205bcc8d3f21a7d1718aa6054179506411bb641e073b739ca2c7
+size 691452

fuson_plm/data/blast/plot.py CHANGED Viewed

@@ -19,10 +19,16 @@ def plot_pos_or_id_pcnt_hist(data, column_name, save_path=None, ax=None):
         fig, ax = plt.subplots(figsize=(10, 7))
     # Make the sample data
-    data = data[['aa_seq_len', column_name]].dropna()  # only keep those with alignments
     data[column_name] = data[column_name]*100 # so it's %
     data[f"{column_name} Percent Coverage"] = data[column_name] / data['aa_seq_len']
     # Calculate the mean and median of the percent coverage
     mean_coverage = data[f"{column_name} Percent Coverage"].mean()
     median_coverage = data[f"{column_name} Percent Coverage"].median()

         fig, ax = plt.subplots(figsize=(10, 7))
     # Make the sample data
+    data = data[['seq_id','aa_seq_len', column_name]].dropna()  # only keep those with alignments
     data[column_name] = data[column_name]*100 # so it's %
     data[f"{column_name} Percent Coverage"] = data[column_name] / data['aa_seq_len']
+    # Save this sample data
+    source_data_save_path = save_path.replace(".png","_source_data.csv")
+    source_data = data[['seq_id',f"{column_name} Percent Coverage"]].sort_values(by=f"{column_name} Percent Coverage",ascending=True)
+    source_data[f"{column_name} Percent Coverage"] = source_data[f"{column_name} Percent Coverage"].round(3)
+    source_data.to_csv(source_data_save_path,index=False)
     # Calculate the mean and median of the percent coverage
     mean_coverage = data[f"{column_name} Percent Coverage"].mean()
     median_coverage = data[f"{column_name} Percent Coverage"].median()

fuson_plm/data/split_vis.py CHANGED Viewed

@@ -6,312 +6,8 @@ import pickle
 import pandas as pd
 import os
 from fuson_plm.utils.logging import log_update
-from fuson_plm.utils.visualizing import set_font
-def calculate_aa_composition(sequences):
-    composition = {}
-    total_length = sum([len(seq) for seq in sequences])
-    for seq in sequences:
-        for aa in seq:
-            if aa in composition:
-                composition[aa] += 1
-            else:
-                composition[aa] = 1
-    # Convert counts to relative frequency
-    for aa in composition:
-        composition[aa] /= total_length
-    return composition
-def calculate_shannon_entropy(sequence):
-    """
-    Calculate the Shannon entropy for a given sequence.
-    Args:
-        sequence (str): A sequence of characters (e.g., amino acids or nucleotides).
-    Returns:
-        float: Shannon entropy value.
-    """
-    bases = set(sequence)
-    counts = [sequence.count(base) for base in bases]
-    return entropy(counts, base=2)
-def visualize_splits_hist(train_lengths, val_lengths, test_lengths, colormap, savepath=f'../data/splits/length_distributions.png', axes=None):
-    log_update('\nMaking histogram of length distributions')
-    # Create a figure and axes with 1 row and 3 columns
-    if axes is None:
-        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
-    # Unpack the labels and titles
-    xlabel, ylabel = ['Sequence Length (AA)', 'Frequency']
-    # Plot the first histogram
-    axes[0].hist(train_lengths, bins=20, edgecolor='k',color=colormap['train'])
-    axes[0].set_xlabel(xlabel, fontsize=24)
-    axes[0].set_ylabel(ylabel, fontsize=24)
-    axes[0].set_title(f'Train Set Length Distribution (n={len(train_lengths)})', fontsize=24)
-    axes[0].grid(True)
-    axes[0].set_axisbelow(True)
-    axes[0].tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
-    axes[0].tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
-    # Plot the second histogram
-    axes[1].hist(val_lengths, bins=20, edgecolor='k',color=colormap['val'])
-    axes[1].set_xlabel(xlabel, fontsize=24)
-    axes[1].set_ylabel(ylabel, fontsize=24)
-    axes[1].set_title(f'Validation Set Length Distribution (n={len(val_lengths)})', fontsize=24)
-    axes[1].grid(True)
-    axes[1].set_axisbelow(True)
-    axes[1].tick_params(axis='x', labelsize=24)
-    axes[1].tick_params(axis='y', labelsize=24)
-    # Plot the third histogram
-    axes[2].hist(test_lengths, bins=20, edgecolor='k',color=colormap['test'])
-    axes[2].set_xlabel(xlabel, fontsize=24)
-    axes[2].set_ylabel(ylabel, fontsize=24)
-    axes[2].set_title(f'Test Set Length Distribution (n={len(test_lengths)})', fontsize=24)
-    axes[2].grid(True)
-    axes[2].set_axisbelow(True)
-    axes[2].tick_params(axis='x', labelsize=24)
-    axes[2].tick_params(axis='y', labelsize=24)
-    # Adjust layout
-    if savepath is not None:
-        plt.tight_layout()
-        # Save the figure
-        plt.savefig(savepath)
-def visualize_splits_scatter(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, colormap, savepath='../data/splits/scatterplot.png', ax=None):
-    log_update("\nMaking scatterplot with distribution of cluster sizes across train, test, and val")
-    # Make grouped versions of these DataFrames for size analysis
-    train_clustersgb = train_clusters.groupby('representative seq_id')['member seq_id'].count().reset_index().rename(columns={'member seq_id':'member count'})
-    val_clustersgb = val_clusters.groupby('representative seq_id')['member seq_id'].count().reset_index().rename(columns={'member seq_id':'member count'})
-    test_clustersgb = test_clusters.groupby('representative seq_id')['member seq_id'].count().reset_index().rename(columns={'member seq_id':'member count'})
-    # Isolate benchmark-containing clusters so their contribution can be plotted separately
-    total_test_proteins = sum(test_clustersgb['member count'])
-    test_clustersgb['benchmark cluster'] = test_clustersgb['representative seq_id'].isin(benchmark_cluster_reps)
-    benchmark_clustersgb = test_clustersgb.loc[test_clustersgb['benchmark cluster']].reset_index(drop=True)
-    test_clustersgb = test_clustersgb.loc[test_clustersgb['benchmark cluster']==False].reset_index(drop=True)
-    # Convert them to value counts
-    train_clustersgb = train_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
-    val_clustersgb = val_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
-    test_clustersgb = test_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
-    benchmark_clustersgb = benchmark_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
-    # Get the percentage of each dataset that's made of each cluster size
-    train_clustersgb['n_proteins'] = train_clustersgb['cluster size (n_members)']*train_clustersgb['n_clusters']    # proteins per cluster * n clusters = # proteins
-    train_clustersgb['percent_proteins'] = train_clustersgb['n_proteins']/sum(train_clustersgb['n_proteins'])
-    val_clustersgb['n_proteins'] = val_clustersgb['cluster size (n_members)']*val_clustersgb['n_clusters']
-    val_clustersgb['percent_proteins'] = val_clustersgb['n_proteins']/sum(val_clustersgb['n_proteins'])
-    test_clustersgb['n_proteins'] = test_clustersgb['cluster size (n_members)']*test_clustersgb['n_clusters']
-    test_clustersgb['percent_proteins'] = test_clustersgb['n_proteins']/total_test_proteins
-    benchmark_clustersgb['n_proteins'] = benchmark_clustersgb['cluster size (n_members)']*benchmark_clustersgb['n_clusters']
-    benchmark_clustersgb['percent_proteins'] = benchmark_clustersgb['n_proteins']/total_test_proteins
-    # Specially mark the benchmark clusters because these can't be reallocated
-    if ax is None:
-        fig, ax = plt.subplots(figsize=(18, 6))
-    ax.plot(train_clustersgb['cluster size (n_members)'],train_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['train'],label='train')
-    ax.plot(val_clustersgb['cluster size (n_members)'],val_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['val'],label='val')
-    ax.plot(test_clustersgb['cluster size (n_members)'],test_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['test'],label='test')
-    ax.plot(benchmark_clustersgb['cluster size (n_members)'],benchmark_clustersgb['percent_proteins'],
-            marker='o',
-            linestyle='None',
-            markerfacecolor=colormap['test'],      # fill same as test
-            markeredgecolor='black',    # outline black
-            markeredgewidth=1.5,
-            label='benchmark'
-        )
-    ax.set_ylabel('Percentage of Proteins in Dataset', fontsize=24)
-    ax.set_xlabel('Cluster Size', fontsize=24)
-    ax.tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
-    ax.tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
-    ax.legend(fontsize=24,markerscale=4)
-    # save the figure
-    if savepath is not None:
-        plt.tight_layout()
-        plt.savefig(savepath)
-        log_update(f"\tSaved figure to {savepath}")
-def get_avg_embeddings_for_tsne(train_sequences, val_sequences, test_sequences, embedding_path='fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl'):
-    embeddings = {}
-    try:
-        with open(embedding_path, 'rb') as f:
-            embeddings = pickle.load(f)
-        train_embeddings = [v for k, v in embeddings.items() if k in train_sequences]
-        val_embeddings = [v for k, v in embeddings.items() if k in val_sequences]
-        test_embeddings = [v for k, v in embeddings.items() if k in test_sequences]
-        return train_embeddings, val_embeddings, test_embeddings
-    except:
-        print("could not open embeddings")
-def visualize_splits_tsne(train_sequences, val_sequences, test_sequences, colormap, esm_type="esm2_t33_650M_UR50D", embedding_path="fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl", savepath='../data/splits/tsne_plot.png',ax=None):
-    """
-    Generate a t-SNE plot of embeddings for train, test, and validation.
-    """
-    log_update('\nMaking t-SNE plot of train, val, and test embeddings')
-    # Combine the embeddings into one array
-    train_embeddings, val_embeddings, test_embeddings = get_avg_embeddings_for_tsne(train_sequences, val_sequences, test_sequences, embedding_path=embedding_path)
-    embeddings = np.concatenate([train_embeddings, val_embeddings, test_embeddings])
-    # Labels for the embeddings
-    labels = ['train'] * len(train_embeddings) + ['val'] * len(val_embeddings) + ['test'] * len(test_embeddings)
-    # Perform t-SNE
-    tsne = TSNE(n_components=2, random_state=42)
-    tsne_results = tsne.fit_transform(embeddings)
-    # Convert t-SNE results into a DataFrame
-    tsne_df = pd.DataFrame(data=tsne_results, columns=['TSNE_1', 'TSNE_2'])
-    tsne_df['label'] = labels
-    # Plotting
-    if ax is None:
-        fig, ax = plt.subplots(figsize=(10, 8))
-    # Scatter plot for each set
-    for label, color in colormap.items():
-        subset = tsne_df[tsne_df['label'] == label].reset_index(drop=True)
-        ax.scatter(subset['TSNE_1'], subset['TSNE_2'], c=color, label=label.capitalize(), alpha=0.6)
-    ax.set_title(f't-SNE of {esm_type} Embeddings')
-    ax.set_xlabel('t-SNE Dimension 1')
-    ax.set_ylabel('t-SNE Dimension 2')
-    ax.legend(fontsize=24, markerscale=2)
-    ax.grid(True)
-    # Save the figure if savepath is provided
-    if savepath:
-        plt.tight_layout()
-        fig.savefig(savepath)
-def visualize_splits_shannon_entropy(train_sequences, val_sequences, test_sequences, colormap, savepath='../data/splits/shannon_entropy_plot.png',axes=None):
-    """
-    Generate Shannon entropy plots for train, validation, and test sets.
-    """
-    log_update('\nMaking histogram of Shannon Entropy distributions')
-    train_entropy = [calculate_shannon_entropy(seq) for seq in train_sequences]
-    val_entropy = [calculate_shannon_entropy(seq) for seq in val_sequences]
-    test_entropy = [calculate_shannon_entropy(seq) for seq in test_sequences]
-    if axes is None:
-        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
-    axes[0].hist(train_entropy, bins=20, edgecolor='k', color=colormap['train'])
-    axes[0].set_title(f'Train Set (n={len(train_entropy)})', fontsize=24)
-    axes[0].set_xlabel('Shannon Entropy', fontsize=24)
-    axes[0].set_ylabel('Frequency', fontsize=24)
-    axes[0].grid(True)
-    axes[0].set_axisbelow(True)
-    axes[0].tick_params(axis='x', labelsize=24)
-    axes[0].tick_params(axis='y', labelsize=24)
-    axes[1].hist(val_entropy, bins=20, edgecolor='k', color=colormap['val'])
-    axes[1].set_title(f'Validation Set (n={len(val_entropy)})', fontsize=24)
-    axes[1].set_xlabel('Shannon Entropy', fontsize=24)
-    axes[1].grid(True)
-    axes[1].set_axisbelow(True)
-    axes[1].tick_params(axis='x', labelsize=24)
-    axes[1].tick_params(axis='y', labelsize=24)
-    axes[2].hist(test_entropy, bins=20, edgecolor='k', color=colormap['test'])
-    axes[2].set_title(f'Test Set (n={len(test_entropy)})', fontsize=24)
-    axes[2].set_xlabel('Shannon Entropy', fontsize=24)
-    axes[2].grid(True)
-    axes[2].set_axisbelow(True)
-    axes[2].tick_params(axis='x', labelsize=24)
-    axes[2].tick_params(axis='y', labelsize=24)
-    if savepath is not None:
-        plt.tight_layout()
-        plt.savefig(savepath)
-def visualize_splits_aa_composition(train_sequences, val_sequences, test_sequences,colormap, savepath='../data/splits/aa_comp.png',ax=None):
-    log_update('\nMaking bar plot of AA composition across each set')
-    train_comp = calculate_aa_composition(train_sequences)
-    val_comp = calculate_aa_composition(val_sequences)
-    test_comp = calculate_aa_composition(test_sequences)
-    # Create DataFrame
-    comp_df = pd.DataFrame([train_comp, val_comp, test_comp], index=['train', 'val', 'test']).T
-    colors = [colormap[col] for col in comp_df.columns]
-    # Plotting
-    #fig, ax = plt.subplots(figsize=(12, 6))
-    if ax is None:
-        fig, ax = plt.subplots(figsize=(12, 6))
-    else:
-        fig = ax.get_figure()
-    comp_df.plot(kind='bar', color=colors, ax=ax)
-    ax.set_title('Amino Acid Composition Across Datasets', fontsize=24)
-    ax.set_xlabel('Amino Acid', fontsize=24)
-    ax.set_ylabel('Relative Frequency', fontsize=24)
-    ax.tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
-    ax.tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
-    ax.legend(fontsize=16, markerscale=2)
-    if savepath is not None:
-        fig.savefig(savepath)
-def visualize_splits(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, train_color='#0072B2',val_color='#009E73',test_color='#E69F00',esm_embeddings_path=None, onehot_embeddings_path=None):
-    colormap = {
-        'train': train_color,
-        'val': val_color,
-        'test': test_color
-    }
-    # Add columns for plotting
-    train_clusters['member length'] = train_clusters['member seq'].str.len()
-    val_clusters['member length'] = val_clusters['member seq'].str.len()
-    test_clusters['member length'] = test_clusters['member seq'].str.len()
-    # Prepare lengths and seqs for plotting
-    train_lengths = train_clusters['member length'].tolist()
-    val_lengths = val_clusters['member length'].tolist()
-    test_lengths = test_clusters['member length'].tolist()
-    train_sequences = train_clusters['member seq'].tolist()
-    val_sequences = val_clusters['member seq'].tolist()
-    test_sequences = test_clusters['member seq'].tolist()
-    # Create a combined figure with 3 rows and 3 columns
-    fig_combined, axs = plt.subplots(3, 3, figsize=(24, 18))
-    # Make the three visualization plots for saving TOGETHER
-    visualize_splits_hist(train_lengths,val_lengths,test_lengths,colormap, savepath=None,axes=axs[0])
-    visualize_splits_shannon_entropy(train_sequences,val_sequences,test_sequences,colormap,savepath=None,axes=axs[1])
-    visualize_splits_scatter(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, colormap, savepath=None, ax=axs[2, 0])
-    visualize_splits_aa_composition(train_sequences,val_sequences,test_sequences, colormap, savepath=None, ax=axs[2, 1])
-    if not(esm_embeddings_path is None) and os.path.exists(esm_embeddings_path):
-        visualize_splits_tsne(train_sequences, val_sequences, test_sequences, colormap, savepath=None, ax=axs[2, 2])
-    else:
-    # Leave the last subplot blank
-        axs[2, 2].axis('off')
-    plt.tight_layout()
-    fig_combined.savefig('../data/splits/combined_plot.png')
-    # Make the three visualization plots for saving separately
-    visualize_splits_hist(train_clusters['member length'].tolist(), val_clusters['member length'].tolist(), test_clusters['member length'].tolist(),colormap)
-    visualize_splits_scatter(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, colormap)
-    visualize_splits_aa_composition(train_clusters['member seq'].tolist(), val_clusters['member seq'].tolist(), test_clusters['member seq'].tolist(),colormap)
-    visualize_splits_shannon_entropy(train_sequences,val_sequences,test_sequences,colormap)
-    if not(esm_embeddings_path is None) and os.path.exists(esm_embeddings_path):
-        visualize_splits_tsne(train_clusters['member seq'].tolist(), val_clusters['member seq'].tolist(), test_clusters['member seq'].tolist(),colormap)
 def main():
     set_font()
     train_clusters = pd.read_csv('splits/train_cluster_split.csv')
@@ -326,8 +22,19 @@ def main():
     # Use benchmark_seq_ids to find which clusters contain benchmark sequences.
     benchmark_cluster_reps = clusters.loc[clusters['member seq_id'].isin(benchmark_seq_ids)]['representative seq_id'].unique().tolist()
-    visualize_splits(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps,
-                         esm_embeddings_path='fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl', onehot_embeddings_path=None)
 if __name__ == "__main__":
     main()

 import pandas as pd
 import os
 from fuson_plm.utils.logging import log_update
+from fuson_plm.utils.visualizing import set_font, visualize_splits
 def main():
     set_font()
     train_clusters = pd.read_csv('splits/train_cluster_split.csv')
     # Use benchmark_seq_ids to find which clusters contain benchmark sequences.
     benchmark_cluster_reps = clusters.loc[clusters['member seq_id'].isin(benchmark_seq_ids)]['representative seq_id'].unique().tolist()
+    visualize_splits(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps)
+    ## Add seq_id to every source data file that is saved from visualize_splits
+    seq_to_id_dict = dict(zip(fuson_db['aa_seq'],fuson_db['seq_id']))
+    files_to_edit = os.listdir("splits/split_vis")
+    files_to_edit = [x for x in files_to_edit if x[-4::]==".csv"]
+    log_update(f"Adding seq_ids to the following files: {files_to_edit}")
+    for fname in files_to_edit:
+        source_data_file = pd.read_csv(f"splits/split_vis/{fname}")
+        if "sequence" in list(source_data_file.columns):
+            source_data_file["seq_id"] = source_data_file["sequence"].map(seq_to_id_dict)
+            source_data_file.drop(columns=['sequence']).to_csv(f"splits/split_vis/{fname}",index=False)
 if __name__ == "__main__":
     main()

fuson_plm/data/splits/combined_plot.png DELETED Viewed

Binary file (267 kB)

fuson_plm/data/splits/split_vis/aa_comp.png ADDED Viewed

fuson_plm/data/splits/split_vis/aa_comp_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2f71076771c047787076795f088ec99c532fe13a901890c29355431d5cbf428
+size 1273

fuson_plm/data/splits/split_vis/combined_plot.png ADDED Viewed

fuson_plm/data/splits/split_vis/length_distributions.png ADDED Viewed

fuson_plm/data/splits/split_vis/scatterplot.png ADDED Viewed

fuson_plm/data/splits/split_vis/scatterplot_benchmark_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45b3b580146c214c76fec277ac721b2de0f1f9a5f0c8096dad13f39340d15da1
+size 755

fuson_plm/data/splits/split_vis/scatterplot_test_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e888045ff3c6824e98ddb3c25f45cdae80d06bfddf7878c1309ef7d47da9ce8
+size 794

fuson_plm/data/splits/split_vis/scatterplot_train_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8047af5c61b9238654dbd4d3f693fad68e09ece90fb72fb5879530bd17969d3d
+size 1396

fuson_plm/data/splits/split_vis/scatterplot_val_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5802d962e638004536d5473ccb7d5f3150a481d1295e1abf9fed55fe28311aea
+size 819

fuson_plm/data/splits/split_vis/shannon_entropy_plot.png ADDED Viewed

fuson_plm/data/splits/split_vis/shannon_entropy_plot_test_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51bb7926f16e082974f2693b67b77d7d7246e63766cb76ecf9f0638fe9656670
+size 112646

fuson_plm/data/splits/split_vis/shannon_entropy_plot_train_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4dc397807e7f6114d8cdd571efb779ac9d98f76f1e5f0c9a872fe1b355453d54
+size 902131

fuson_plm/data/splits/split_vis/shannon_entropy_plot_val_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d25a06e2ccf701d7299922a2978ea478dd0fd78339e852060628a71bbbe024a9
+size 112779

fuson_plm/data/splits/split_vis/test_lengths_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2ac21f806a235f769afe107ed8bc89476d3f9af66c30ff2b1c6ff55da9a8a1f1
+size 54367

fuson_plm/data/splits/split_vis/train_lengths_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:146c64a03c446cc7c5a59b4d72edd099b526e1f4c9e777a1cbed7f6dd410a3b6
+size 435911

fuson_plm/data/splits/split_vis/val_lengths_source_data.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5fd936ad5daed767e3815dac6315d92abcabb4fe506c169e6a958031a4fd2d97
+size 54478

fuson_plm/utils/visualizing.py CHANGED Viewed

@@ -34,11 +34,12 @@ def set_font():
     # Set the font family globally to Ubuntu
     plt.rcParams['font.family'] = regular_font.get_name()
-    # Set the fonts for math text (like for labels) to use the loaded Ubuntu fonts
     plt.rcParams['mathtext.fontset'] = 'custom'
     plt.rcParams['mathtext.rm'] = regular_font.get_name()
-    plt.rcParams['mathtext.it'] = f'{italic_font.get_name()}'
-    plt.rcParams['mathtext.bf'] = f'{bold_font.get_name()}'
 global default_color_map
 default_color_map = {
@@ -98,7 +99,7 @@ def calculate_shannon_entropy(sequence):
     counts = [sequence.count(base) for base in bases]
     return entropy(counts, base=2)
-def visualize_splits_hist(train_lengths=None, val_lengths=None, test_lengths=None, colormap=None, savepath=f'splits/length_distributions.png', axes=None):
     """
     Works to plot train, val, test; train, val; or train, test
     """
@@ -118,7 +119,7 @@ def visualize_splits_hist(train_lengths=None, val_lengths=None, test_lengths=Non
         total_plots-=1
     # Create a figure and axes with 1 row and 3 columns
-    fig_individual, axes_individual = plt.subplots(1, total_plots, figsize=(6*total_plots, 6))
     # Set axes list
     axes_list = [axes_individual] if axes is None else [axes_individual, axes]
@@ -129,29 +130,35 @@ def visualize_splits_hist(train_lengths=None, val_lengths=None, test_lengths=Non
     for cur_axes in axes_list:
         # Plot the first histogram
         cur_axes[0].hist(train_lengths, bins=20, edgecolor='k',color=colormap['train'])
-        cur_axes[0].set_xlabel(xlabel)
-        cur_axes[0].set_ylabel(ylabel)
-        cur_axes[0].set_title(f'Train Set Length Distribution (n={len(train_lengths)})')
         cur_axes[0].grid(True)
         cur_axes[0].set_axisbelow(True)
         # Plot the second histogram
         if not(val_plot_index is None):
             cur_axes[val_plot_index].hist(val_lengths, bins=20, edgecolor='k',color=colormap['val'])
-            cur_axes[val_plot_index].set_xlabel(xlabel)
-            cur_axes[val_plot_index].set_ylabel(ylabel)
-            cur_axes[val_plot_index].set_title(f'Validation Set Length Distribution (n={len(val_lengths)})')
             cur_axes[val_plot_index].grid(True)
             cur_axes[val_plot_index].set_axisbelow(True)
         # Plot the third histogram
         if not(test_plot_index is None):
             cur_axes[test_plot_index].hist(test_lengths, bins=20, edgecolor='k',color=colormap['test'])
-            cur_axes[test_plot_index].set_xlabel(xlabel)
-            cur_axes[test_plot_index].set_ylabel(ylabel)
-            cur_axes[test_plot_index].set_title(f'Test Set Length Distribution (n={len(test_lengths)})')
             cur_axes[test_plot_index].grid(True)
             cur_axes[test_plot_index].set_axisbelow(True)
     # Adjust layout
     fig_individual.set_tight_layout(True)
@@ -160,7 +167,7 @@ def visualize_splits_hist(train_lengths=None, val_lengths=None, test_lengths=Non
     fig_individual.savefig(savepath)
     log_update(f"\tSaved figure to {savepath}")
-def visualize_splits_scatter(train_clusters=None, val_clusters=None, test_clusters=None, benchmark_cluster_reps=None, colormap=None, savepath='splits/scatterplot.png', axes=None):
     set_font()
     if colormap is None: colormap=default_color_map
@@ -209,10 +216,13 @@ def visualize_splits_scatter(train_clusters=None, val_clusters=None, test_cluste
     # Specially mark the benchmark clusters because these can't be reallocated
     for ax in axes_list:
         ax.plot(train_clustersgb['cluster size (n_members)'],train_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['train'],label='train')
         if not(val_clusters is None):
             ax.plot(val_clustersgb['cluster size (n_members)'],val_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['val'],label='val')
         if not(test_clusters is None):
             ax.plot(test_clustersgb['cluster size (n_members)'],test_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['test'],label='test')
         if not(benchmark_cluster_reps is None):
             ax.plot(benchmark_clustersgb['cluster size (n_members)'],benchmark_clustersgb['percent_proteins'],
                 marker='o',
@@ -222,8 +232,13 @@ def visualize_splits_scatter(train_clusters=None, val_clusters=None, test_cluste
                 markeredgewidth=1.5,
                 label='benchmark'
             )
-        ax.set(ylabel='Percentage of Proteins in Dataset',xlabel='cluster_size')
-        ax.legend()
     # save the figure
     fig_individual.set_tight_layout(True)
@@ -231,7 +246,7 @@ def visualize_splits_scatter(train_clusters=None, val_clusters=None, test_cluste
     log_update(f"\tSaved figure to {savepath}")
-def visualize_splits_tsne(train_sequences=None, val_sequences=None, test_sequences=None, colormap=None, esm_type="esm2_t33_650M_UR50D", embedding_path="fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl", savepath='splits/tsne_plot.png',axes=None):
     set_font()
     if colormap is None: colormap=default_color_map
@@ -285,7 +300,7 @@ def visualize_splits_tsne(train_sequences=None, val_sequences=None, test_sequenc
     fig_individual.savefig(savepath)
     log_update(f"\tSaved figure to {savepath}")
-def visualize_splits_shannon_entropy(train_sequences=None, val_sequences=None, test_sequences=None, colormap=None, savepath='splits/shannon_entropy_plot.png',axes=None):
     set_font()
     """
     Generate Shannon entropy plots for train, validation, and test sets.
@@ -316,31 +331,52 @@ def visualize_splits_shannon_entropy(train_sequences=None, val_sequences=None, t
     for ax in axes_list:
         ax[0].hist(train_entropy, bins=20, edgecolor='k', color=colormap['train'])
-        ax[0].set_title(f'Train Set (n={len(train_entropy)})')
-        ax[0].set_xlabel('Shannon Entropy')
-        ax[0].set_ylabel('Frequency')
         ax[0].grid(True)
         ax[0].set_axisbelow(True)
         if not(val_plot_index is None):
             ax[val_plot_index].hist(val_entropy, bins=20, edgecolor='k', color=colormap['val'])
-            ax[val_plot_index].set_title(f'Validation Set (n={len(val_entropy)})')
-            ax[val_plot_index].set_xlabel('Shannon Entropy')
             ax[val_plot_index].grid(True)
             ax[val_plot_index].set_axisbelow(True)
         if not(test_plot_index is None):
             ax[test_plot_index].hist(test_entropy, bins=20, edgecolor='k', color=colormap['test'])
-            ax[test_plot_index].set_title(f'Test Set (n={len(test_entropy)})')
-            ax[test_plot_index].set_xlabel('Shannon Entropy')
             ax[test_plot_index].grid(True)
             ax[test_plot_index].set_axisbelow(True)
     fig_individual.set_tight_layout(True)
     fig_individual.savefig(savepath)
     log_update(f"\tSaved figure to {savepath}")
-def visualize_splits_aa_composition(train_sequences=None, val_sequences=None, test_sequences=None, colormap=None, savepath='splits/aa_comp.png',axes=None):
     set_font()
     if colormap is None: colormap=default_color_map
@@ -365,13 +401,17 @@ def visualize_splits_aa_composition(train_sequences=None, val_sequences=None, te
     if (val_sequences is None) and not(test_sequences is None):
         comp_df = pd.DataFrame([train_comp, test_comp], index=['train', 'test']).T
     colors = [colormap[col] for col in comp_df.columns]
     # Plotting
     for ax in axes_list:
         comp_df.plot(kind='bar', color=colors, ax=ax)
-        ax.set_title('Amino Acid Composition Across Datasets')
-        ax.set_xlabel('Amino Acid')
-        ax.set_ylabel('Relative Frequency')
     fig_individual.set_tight_layout(True)
     fig_individual.savefig(savepath)
@@ -379,6 +419,7 @@ def visualize_splits_aa_composition(train_sequences=None, val_sequences=None, te
 ### Outer methods for visualizing splits
 def visualize_splits(train_clusters=None, val_clusters=None, test_clusters=None, benchmark_cluster_reps=None, train_color='#0072B2',val_color='#009E73',test_color='#E69F00',esm_embeddings_path=None, onehot_embeddings_path=None):
     colormap = {
         'train': train_color,
         'val': val_color,
@@ -413,6 +454,14 @@ def visualize_train_val_test_splits(train_clusters, val_clusters, test_clusters,
     val_sequences = val_clusters['member seq'].tolist()
     test_sequences = test_clusters['member seq'].tolist()
     # Create a combined figure with 3 rows and 3 columns
     set_font()
     fig_combined, axs = plt.subplots(3, 3, figsize=(24, 18))
@@ -445,8 +494,9 @@ def visualize_train_val_test_splits(train_clusters, val_clusters, test_clusters,
         axs[2, 2].axis('off')
     plt.tight_layout()
-    fig_combined.savefig('splits/combined_plot.png')
-    log_update(f"\nSaved combined figure to splits/combined_plot.png")
 def visualize_train_test_splits(train_clusters, test_clusters,  benchmark_cluster_reps=None, colormap=None, esm_embeddings_path=None, onehot_embeddings_path=None):
     if colormap is None: colormap=default_color_map
@@ -493,8 +543,8 @@ def visualize_train_test_splits(train_clusters, test_clusters,  benchmark_cluste
                                     colormap=colormap, axes=axs[2, 1])
     plt.tight_layout()
-    fig_combined.savefig('splits/combined_plot.png')
-    log_update(f"\nSaved combined figure to splits/combined_plot.png")
 def visualize_train_val_splits(train_clusters, val_clusters, benchmark_cluster_reps=None, colormap=None, esm_embeddings_path=None, onehot_embeddings_path=None):
     if colormap is None: colormap=default_color_map
@@ -541,5 +591,5 @@ def visualize_train_val_splits(train_clusters, val_clusters, benchmark_cluster_r
                                     colormap=colormap, axes=axs[2, 1])
     plt.tight_layout()
-    fig_combined.savefig('splits/combined_plot.png')
-    log_update(f"\nSaved combined figure to splits/combined_plot.png")

     # Set the font family globally to Ubuntu
     plt.rcParams['font.family'] = regular_font.get_name()
+    # Set the font family globally to Ubuntu
+    plt.rcParams['font.family'] = regular_font.get_name()
     plt.rcParams['mathtext.fontset'] = 'custom'
     plt.rcParams['mathtext.rm'] = regular_font.get_name()
+    plt.rcParams['mathtext.it'] = italic_font.get_name()
+    plt.rcParams['mathtext.bf'] = bold_font.get_name()
 global default_color_map
 default_color_map = {
     counts = [sequence.count(base) for base in bases]
     return entropy(counts, base=2)
+def visualize_splits_hist(train_lengths=None, val_lengths=None, test_lengths=None, colormap=None, savepath=f'splits/split_vis/length_distributions.png', axes=None):
     """
     Works to plot train, val, test; train, val; or train, test
     """
         total_plots-=1
     # Create a figure and axes with 1 row and 3 columns
+    fig_individual, axes_individual = plt.subplots(1, total_plots, figsize=(8*total_plots, 8))
     # Set axes list
     axes_list = [axes_individual] if axes is None else [axes_individual, axes]
     for cur_axes in axes_list:
         # Plot the first histogram
         cur_axes[0].hist(train_lengths, bins=20, edgecolor='k',color=colormap['train'])
+        cur_axes[0].set_xlabel(xlabel, fontsize=24)
+        cur_axes[0].set_ylabel(ylabel, fontsize=24)
+        cur_axes[0].set_title(f'Train Set Length Distribution (n={len(train_lengths)})', fontsize=24)
         cur_axes[0].grid(True)
         cur_axes[0].set_axisbelow(True)
+        cur_axes[0].tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
+        cur_axes[0].tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
         # Plot the second histogram
         if not(val_plot_index is None):
             cur_axes[val_plot_index].hist(val_lengths, bins=20, edgecolor='k',color=colormap['val'])
+            cur_axes[val_plot_index].set_xlabel(xlabel, fontsize=24)
+            cur_axes[val_plot_index].set_ylabel(ylabel, fontsize=24)
+            cur_axes[val_plot_index].set_title(f'Validation Set Length Distribution (n={len(val_lengths)})', fontsize=24)
             cur_axes[val_plot_index].grid(True)
             cur_axes[val_plot_index].set_axisbelow(True)
+            cur_axes[val_plot_index].tick_params(axis='x', labelsize=24)
+            cur_axes[val_plot_index].tick_params(axis='y', labelsize=24)
         # Plot the third histogram
         if not(test_plot_index is None):
             cur_axes[test_plot_index].hist(test_lengths, bins=20, edgecolor='k',color=colormap['test'])
+            cur_axes[test_plot_index].set_xlabel(xlabel, fontsize=24)
+            cur_axes[test_plot_index].set_ylabel(ylabel, fontsize=24)
+            cur_axes[test_plot_index].set_title(f'Test Set Length Distribution (n={len(test_lengths)})', fontsize=24)
             cur_axes[test_plot_index].grid(True)
             cur_axes[test_plot_index].set_axisbelow(True)
+            cur_axes[test_plot_index].tick_params(axis='x', labelsize=24)
+            cur_axes[test_plot_index].tick_params(axis='y', labelsize=24)
     # Adjust layout
     fig_individual.set_tight_layout(True)
     fig_individual.savefig(savepath)
     log_update(f"\tSaved figure to {savepath}")
+def visualize_splits_scatter(train_clusters=None, val_clusters=None, test_clusters=None, benchmark_cluster_reps=None, colormap=None, savepath='splits/split_vis/scatterplot.png', axes=None):
     set_font()
     if colormap is None: colormap=default_color_map
     # Specially mark the benchmark clusters because these can't be reallocated
     for ax in axes_list:
         ax.plot(train_clustersgb['cluster size (n_members)'],train_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['train'],label='train')
+        train_clustersgb.to_csv(savepath.replace(".png","_train_source_data.csv"),index=False)
         if not(val_clusters is None):
             ax.plot(val_clustersgb['cluster size (n_members)'],val_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['val'],label='val')
+            val_clustersgb.to_csv(savepath.replace(".png","_val_source_data.csv"),index=False)
         if not(test_clusters is None):
             ax.plot(test_clustersgb['cluster size (n_members)'],test_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['test'],label='test')
+            test_clustersgb.to_csv(savepath.replace(".png","_test_source_data.csv"),index=False)
         if not(benchmark_cluster_reps is None):
             ax.plot(benchmark_clustersgb['cluster size (n_members)'],benchmark_clustersgb['percent_proteins'],
                 marker='o',
                 markeredgewidth=1.5,
                 label='benchmark'
             )
+            benchmark_clustersgb.to_csv(savepath.replace(".png","_benchmark_source_data.csv"),index=False)
+        ax.set_ylabel('Percentage of Proteins in Dataset', fontsize=24)
+        ax.set_xlabel('Cluster Size', fontsize=24)
+        ax.tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
+        ax.tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
+        ax.legend(fontsize=24,markerscale=4)
     # save the figure
     fig_individual.set_tight_layout(True)
     log_update(f"\tSaved figure to {savepath}")
+def visualize_splits_tsne(train_sequences=None, val_sequences=None, test_sequences=None, colormap=None, esm_type="esm2_t33_650M_UR50D", embedding_path="fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl", savepath='splits/split_vis/tsne_plot.png',axes=None):
     set_font()
     if colormap is None: colormap=default_color_map
     fig_individual.savefig(savepath)
     log_update(f"\tSaved figure to {savepath}")
+def visualize_splits_shannon_entropy(train_sequences=None, val_sequences=None, test_sequences=None, colormap=None, savepath='splits/split_vis/shannon_entropy_plot.png',axes=None):
     set_font()
     """
     Generate Shannon entropy plots for train, validation, and test sets.
     for ax in axes_list:
         ax[0].hist(train_entropy, bins=20, edgecolor='k', color=colormap['train'])
+        ax[0].set_title(f'Train Set (n={len(train_entropy)})', fontsize=24)
+        ax[0].set_xlabel('Shannon Entropy', fontsize=24)
+        ax[0].set_ylabel('Frequency', fontsize=24)
         ax[0].grid(True)
         ax[0].set_axisbelow(True)
+        axes[0].tick_params(axis='x', labelsize=24)
+        axes[0].tick_params(axis='y', labelsize=24)
+        train_shannon_source_data = pd.DataFrame(data={
+            'sequence': train_sequences, 'shannon_entropy': train_entropy
+        })
+        train_shannon_source_data.to_csv(savepath.replace(".png","_train_source_data.csv"),index=False)
         if not(val_plot_index is None):
             ax[val_plot_index].hist(val_entropy, bins=20, edgecolor='k', color=colormap['val'])
+            ax[val_plot_index].set_title(f'Validation Set (n={len(val_entropy)})', fontsize=24)
+            ax[val_plot_index].set_xlabel('Shannon Entropy', fontsize=24)
             ax[val_plot_index].grid(True)
             ax[val_plot_index].set_axisbelow(True)
+            ax[val_plot_index].tick_params(axis='x', labelsize=24)
+            ax[val_plot_index].tick_params(axis='y', labelsize=24)
+            val_shannon_source_data = pd.DataFrame(data={
+            'sequence': val_sequences, 'shannon_entropy': val_entropy
+            })
+            val_shannon_source_data.to_csv(savepath.replace(".png","_val_source_data.csv"),index=False)
         if not(test_plot_index is None):
             ax[test_plot_index].hist(test_entropy, bins=20, edgecolor='k', color=colormap['test'])
+            ax[test_plot_index].set_title(f'Test Set (n={len(test_entropy)})', fontsize=24)
+            ax[test_plot_index].set_xlabel('Shannon Entropy', fontsize=24)
             ax[test_plot_index].grid(True)
             ax[test_plot_index].set_axisbelow(True)
+            ax[test_plot_index].tick_params(axis='x', labelsize=24)
+            ax[test_plot_index].tick_params(axis='y', labelsize=24)
+            test_shannon_source_data = pd.DataFrame(data={
+            'sequence': test_sequences, 'shannon_entropy': test_entropy
+            })
+            test_shannon_source_data.to_csv(savepath.replace(".png","_test_source_data.csv"),index=False)
     fig_individual.set_tight_layout(True)
     fig_individual.savefig(savepath)
     log_update(f"\tSaved figure to {savepath}")
+def visualize_splits_aa_composition(train_sequences=None, val_sequences=None, test_sequences=None, colormap=None, savepath='splits/split_vis/aa_comp.png',axes=None):
     set_font()
     if colormap is None: colormap=default_color_map
     if (val_sequences is None) and not(test_sequences is None):
         comp_df = pd.DataFrame([train_comp, test_comp], index=['train', 'test']).T
     colors = [colormap[col] for col in comp_df.columns]
+    comp_df.to_csv(savepath.replace(".png","_source_data.csv"))
     # Plotting
     for ax in axes_list:
         comp_df.plot(kind='bar', color=colors, ax=ax)
+        ax.set_title('Amino Acid Composition Across Datasets', fontsize=24)
+        ax.set_xlabel('Amino Acid', fontsize=24)
+        ax.set_ylabel('Relative Frequency', fontsize=24)
+        ax.tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
+        ax.tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
+        ax.legend(fontsize=16, markerscale=2)
     fig_individual.set_tight_layout(True)
     fig_individual.savefig(savepath)
 ### Outer methods for visualizing splits
 def visualize_splits(train_clusters=None, val_clusters=None, test_clusters=None, benchmark_cluster_reps=None, train_color='#0072B2',val_color='#009E73',test_color='#E69F00',esm_embeddings_path=None, onehot_embeddings_path=None):
+    os.makedirs("splits/split_vis",exist_ok=True)
     colormap = {
         'train': train_color,
         'val': val_color,
     val_sequences = val_clusters['member seq'].tolist()
     test_sequences = test_clusters['member seq'].tolist()
+    # save length source data
+    train_lengths_source_data=pd.DataFrame(data={"sequence":train_sequences,"length":train_lengths})
+    train_lengths_source_data.to_csv("splits/split_vis/train_lengths_source_data.csv",index=False)
+    val_lengths_source_data=pd.DataFrame(data={"sequence":val_sequences,"length":val_lengths})
+    val_lengths_source_data.to_csv("splits/split_vis/val_lengths_source_data.csv",index=False)
+    test_lengths_source_data=pd.DataFrame(data={"sequence":test_sequences,"length":test_lengths})
+    test_lengths_source_data.to_csv("splits/split_vis/test_lengths_source_data.csv",index=False)
     # Create a combined figure with 3 rows and 3 columns
     set_font()
     fig_combined, axs = plt.subplots(3, 3, figsize=(24, 18))
         axs[2, 2].axis('off')
     plt.tight_layout()
+    fig_combined.set_tight_layout(True)
+    fig_combined.savefig('splits/split_vis/combined_plot.png', bbox_inches="tight")
+    log_update(f"\nSaved combined figure to splits/split_vis/combined_plot.png")
 def visualize_train_test_splits(train_clusters, test_clusters,  benchmark_cluster_reps=None, colormap=None, esm_embeddings_path=None, onehot_embeddings_path=None):
     if colormap is None: colormap=default_color_map
                                     colormap=colormap, axes=axs[2, 1])
     plt.tight_layout()
+    fig_combined.savefig('splits/split_vis/combined_plot.png')
+    log_update(f"\nSaved combined figure to splits/split_vis/combined_plot.png")
 def visualize_train_val_splits(train_clusters, val_clusters, benchmark_cluster_reps=None, colormap=None, esm_embeddings_path=None, onehot_embeddings_path=None):
     if colormap is None: colormap=default_color_map
                                     colormap=colormap, axes=axs[2, 1])
     plt.tight_layout()
+    fig_combined.savefig('splits/split_vis/combined_plot.png')
+    log_update(f"\nSaved combined figure to splits/split_vis/combined_plot.png")