# Embedding exploration This folder contains all the data and code needed to run embedding exploration (Fig. S3). ### Data download To help select TF (transcription factor) and Kinase-containing fusions for investigation (Fig. S3a), Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8) was downloaded as a reference of transcription factors and kinases. ``` benchmarking/ └── embedding_exploration/ └── data/ ├── salokas_2020_tableS3.csv ├── tf_and_kinase_fusions.csv ├── top_genes.csv ``` - **`data/salokas_2020_tableS3.csv`**: Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8) - **`data/tf_and_kinase_fusions.csv`**: set of TF::TF and Kinase::Kinase fusion oncoproteins from FusOn-DB database. Curated in `plot.py` - **`data/top_genes.csv`**: fusion oncoproteins (and their head and tail components) visualized in Fig. S3b. Sequences for head and tail components were pulled from the best-aligned sequences in `fuson_plm/data/blast/blast_outputs/best_htg_alignments_swissprot_seqs.pkl` ### Plotting Run `plot.py` to regenerate plots in Figure S3: ``` # Dictionary: key = run name, values = epochs. (use this option if you've trained your own model) # # Or "FusOn-pLM" to use official model FUSON_PLM_CKPT= "FusOn-pLM" # Type of dim reduction PLOT_UMAP = True PLOT_TSNE = False # Overwriting configs PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made. ``` To run, use: ``` nohup python plot.py > plot.out 2> plot.err & ``` - All **results** are stored in `embedding_exploration/results/`, where `timestamp` is a unique string encoding the date and time when you started training. Below are the FusOn-pLM paper results in `results/final/umap_plots/fuson_plm/best/`: ``` benchmarking/ └── embedding_exploration/ └── results/final/umap_plots/fuson_plm/best/ └── favorites/ ├── umap_favorites_source_data.csv ├── umap_favorites_visualization.png └── tf_and_kinase/ ├── umap_tf_and_kinase_fusions_source_data.csv ├── umap_tf_and_kinase_fusions_visualization.png ``` - **`favorites/umap_favorites_visualization.png`**: Fig. S3b, with the data directly plotted stored in `favorites/umap_favorites_source_data.csv` - **`tf_and_kinase/umap_tf_and_kinase_fusions_visualization.png`**: Fig. S3a, with the data directly plotted stored in `tf_and_kinase/umap_tf_and_kinase_fusions_source_data.csv`.