ctheodoris/Geneformer · Label mapping for using fine tuned model with in silico perturber?

ag2022

24 days ago

I’m slowly working my way through some smelltests with my own data and have two questions.

My understanding of the cosign_shift metric for cell state shifts is that it’s comparing the cosign similarity of perturbed start vs end state to cosign similarity of original start vs end state. Thus if the perturbation brings the start state closer to goal end state the resulting value will be positive with larger magnitude being more perturbation in a favorable direction. If perturbation moves away from the goal end state the result will be negative. Is that correct?
I have a few cell states with very specific gene expression and I’m in silico overexpressing some known key regulators of these specific states but the results dont match my expectations. I’m wondering if I need to adjust something somewhere so that labels match the same label mapping used when I prepared my data for fine-tuning or similar?

state_embs_dict = embex.get_state_embs(cell_states_to_model, finetuned_model_checkpoint_best,original_input_dataset, output_dir,outprefix)

The “input_dataset” im using is my whole (unprepared) tokenized dataset.

Thanks so much for your help!

ctheodoris

Owner 23 days ago

Thank you for your question! Yes, that is generally true for the cosine_shift values. In terms of not matching expectations, this can be due to many factors. For example, you would first want to ensure that your start and goal states are separable within the embedding space of your fine-tuned model. It is also a useful check to confirm that there is significant overlap between the predicted genes and the differential expression of these genes. The predictions do not exactly match the differential expression, and we have shown previously that predictions outperform differential expression in terms of concordance for effective perturbations in a CRISPR screen, for example, but the predicted genes should significantly overlap with being changed in the appropriate direction. For example, if the vast majority of predicted genes were differentially expressed in the opposite expected direction, this would be unexpected and should prompt a closer look at the code and/or data.

ctheodoris changed discussion status to closed 23 days ago

ag2022

21 days ago

Thanks for your reply! Yes the start and end are separate in the embedding space, but let me go investigate the diff genes a bit! Thank you!!