svincoff's picture
dependencies and embedding_exploration benchmark
c43fbc6

Embedding exploration

This folder contains all the data and code needed to run embedding exploration (Fig. S3).

Data download

To help select TF (transcription factor) and Kinase-containing fusions for investigation (Fig. S3a), Supplementary Table 3 from Salokas et al. 2020 was downloaded as a reference of transcription factors and kinases.

benchmarking/
└── embedding_exploration/ 
    └── data/ 
        β”œβ”€β”€ salokas_2020_tableS3.csv
        β”œβ”€β”€ tf_and_kinase_fusions.csv
        β”œβ”€β”€ top_genes.csv
  • data/salokas_2020_tableS3.csv: Supplementary Table 3 from Salokas et al. 2020
  • data/tf_and_kinase_fusions.csv: set of TF::TF and Kinase::Kinase fusion oncoproteins from FusOn-DB database. Curated in plot.py
  • data/top_genes.csv: fusion oncoproteins (and their head and tail components) visualized in Fig. S3b. Sequences for head and tail components were pulled from the best-aligned sequences in fuson_plm/data/blast/blast_outputs/best_htg_alignments_swissprot_seqs.pkl

Plotting

Run plot.py to regenerate plots in Figure S3:

# Dictionary: key = run name, values = epochs. (use this option if you've trained your own model)
# # Or "FusOn-pLM" to use official model
FUSON_PLM_CKPT= "FusOn-pLM"                                            

# Type of dim reduction
PLOT_UMAP = True
PLOT_TSNE = False

# Overwriting configs
PERMISSION_TO_OVERWRITE = False                     # if False, script will halt if it believes these embeddings have already been made. 

To run, use:

nohup python plot.py > plot.out 2> plot.err &
  • All results are stored in embedding_exploration/results/<timestamp>, where timestamp is a unique string encoding the date and time when you started training.

Below are the FusOn-pLM paper results in results/final/umap_plots/fuson_plm/best/:

benchmarking/
└── embedding_exploration/ 
    └── results/final/umap_plots/fuson_plm/best/
        └── favorites/
            β”œβ”€β”€ umap_favorites_source_data.csv
            β”œβ”€β”€ umap_favorites_visualization.png
        └── tf_and_kinase/
            β”œβ”€β”€ umap_tf_and_kinase_fusions_source_data.csv                                         β”œβ”€β”€ umap_tf_and_kinase_fusions_visualization.png
  • favorites/umap_favorites_visualization.png: Fig. S3b, with the data directly plotted stored in favorites/umap_favorites_source_data.csv
  • tf_and_kinase/umap_tf_and_kinase_fusions_visualization.png: Fig. S3a, with the data directly plotted stored in tf_and_kinase/umap_tf_and_kinase_fusions_source_data.csv.