About tk.tokenize_data
Hi Christina! Thank you for the amazing work. I have two questions.
First, regarding the downstream task like cell-type classification, the dataset need to be split into training set and testing set for finetuning and predicting respectively. When tokenizing the data using tk.tokenize_data, there are two methods:
1, tokenize the whole data and then split them;
2, split them and then tokenize them.
Are they different in the generated results?
Second, the data distribution in the downstream task may be different from the pretraining corpus, so should I replace the gene_median_dictionary of the tokenizer where the dictionary would be generated according to the downstream dataset following the instructions from the tutorial "Obtain non-zero median expression value" or just use the same tokenizer as the one used in the pretraining procedure?
Thank you for your questions!
There should be no difference in the two methods above. You should just be sure to keep your training, validation, and test sets separate to avoid contamination of the training data.
You should definitely use the provided tokenizer and NOT regenerate the gene_median_dictionary using the task-specific data. Please read the instructions in the tutorial you referenced for more information about this. This paragraph is of relevance to your question:
"If using Geneformer, to ensure consistency of the normalization factor used for each gene for all future datasets, users should use the Geneformer transcriptome tokenizer to tokenize their datasets and should not re-calculate this normalization factor for their individual dataset . This code for re-calculating the normalization factor should only be used by users who are pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M."