Oct 18, 2024

Hi !
I encountered an error when I tried to reproduce the code using the dataset in the examples as follow:
from geneformer import EmbExtractor

initiate EmbExtractor

embex = EmbExtractor(model_type="CellClassifier",
num_classes=3,
filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
max_ncells=1000,
emb_layer=0,
emb_label=["disease","cell_type"],
labels_to_plot=["disease"],
forward_batch_size=200,
nproc=16)

extracts embedding from input data

input data is tokenized rank value encodings generated by Geneformer tokenizer (see tokenizing_scRNAseq_data.ipynb)

example dataset: https://huggingface.co/datasets/ctheodoris/Genecorpus-#30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset

embs = embex.extract_embs("/Geneformer/fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224",
"/Geneformer/examples/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
"output_directory_embeding/",
"cell_embeding")

=======================================================
ERROR:
File ~/anaconda3/envs/geneformer/lib/python3.11/site-packages/geneformer/emb_extractor.py:79, in get_embs(model, filtered_input_data, emb_mode, layer_to_quant, pad_token_id, forward_batch_size, token_gene_dict, special_token, summary_stat, silent)
76 gene_token_dict = {v: k for k, v in token_gene_dict.items()}
77 cls_token_id = gene_token_dict[""]
78 assert (
---> 79 filtered_input_data["input_ids"][0][0] == cls_token_id
80 ), "First token is not token value"
81 elif emb_mode == "cell":
82 if cls_present:

AssertionError: First token is not token value

Could you tell me what I should do? Thank you so much !

ctheodoris

Owner Oct 18, 2024

Thank you for your question! IT looks like you are using the 30M model and example dataset. In that case, you would need to use the 30M token dictionary. Currently the default is for the 95M model.

ctheodoris changed discussion status to closed Oct 18, 2024

yuexiafx

Oct 18, 2024

Thank you for your kind patience and answer.

ag2022

Nov 27, 2024

•

edited Nov 27, 2024

Hello, I had the same error. When I updated the EmbExtractor call to point to the 30M token dictionary (see below), I then get the following:

embex = EmbExtractor(model_type="CellClassifier",
num_classes=3,
filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
max_ncells=1000,
emb_layer=0,
emb_label=["disease","cell_type"],
labels_to_plot=["disease"],
forward_batch_size=200,
nproc=16,
token_dictionary_file="/path/to/Genecorpus-30M/token_dictionary.pkl"
)

also tried: token_dictionary_file="/path/to/gene_dictionaries_30m/token_dictionary_gc30M.pkl"

embs = embex.extract_embs("/path/to/fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224","/path/to/Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",outdir,"mytest")

Traceback (most recent call last):
File "", line 1, in
File "/tmp/Geneformer/geneformer/emb_extractor.py", line 614, in extract_embs
embs = get_embs(
File "/tmp/Geneformer/geneformer/emb_extractor.py", line 74, in get_embs
assert cls_present, " token missing in token dictionary"
AssertionError: token missing in token dictionary

I must be missing something extremely basic. any pointers would be appreciated! ty!

ctheodoris

Owner Nov 27, 2024

Thank you for your question! Since you are using the 30M dictionary, you need to set special_token to False, since this is a feature of the new model / dictionary only.

The instructions for how to set up the tokenizer to use the 30M vs 95M dictionary are noted here: https://geneformer.readthedocs.io/en/latest/geneformer.tokenizer.html

vivekksingh

Jan 28

Hello,

I am also facing the issue described by ag2022, and have a question on the suggested solution.

To follow the suggested solution of setting the special_token to False, do I need to perform tokenization before I use the embedding extractor?

Looking forward to your reply.

Thanks.

ctheodoris

Owner Jan 28

If you are using a new dataset rather than the example one, then yes you should perform tokenization from your raw counts data. Please follow the instructions in the tokenization example to set it up accordingly for the 30M or 95M model/dictionary. Then, match those settings for the embedding extraction. The defaults are all for the 95M model currently, so the only scenario you need to change these is if you are using the 30M model.

vivekksingh

Jan 29

Hello,

Thanks for your reply. However, I am afraid it does not answer my question. Let me try to provide a brief of what I am trying to do, perhaps that would help clarify the issue I am facing.

Goal:

Execute the example code (https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/extract_and_plot_cell_embeddings.ipynb) as is

Input file used is as suggested in the code:

https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset

Executing the following code block throws error shown below:

Code block:

initiate EmbExtractor

OF NOTE: token_dictionary_file must be set to the gc-30M token dictionary if using a 30M series model

(otherwise the EmbExtractor will use the current default model dictionary)

embex = EmbExtractor(model_type="CellClassifier",
num_classes=3,
filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
max_ncells=1000,
emb_layer=0,
emb_label=["disease","cell_type"],
labels_to_plot=["disease"],
forward_batch_size=200,
nproc=16,
token_dictionary_file="./gene_dictionaries_30m/token_dictionary_gc30M.pkl") # change from current default dictionary for 30M model series

extracts embedding from input data

input data is tokenized rank value encodings generated by Geneformer tokenizer (see tokenizing_scRNAseq_data.ipynb)

example dataset for 30M model series: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset

embs = embex.extract_embs("../fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224", # example 30M fine-tuned model
"/path/to/human_dcm_hcm_nf.dataset",
"/path/to/output_directory/",
"output_prefix")
------------- END --------------------------

Error:

AssertionError Traceback (most recent call last)
in <cell line: 0>()
18 # input data is tokenized rank value encodings generated by Geneformer tokenizer (see tokenizing_scRNAseq_data.ipynb)
19 # example dataset for 30M model series: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset
---> 20 embs = embex.extract_embs("../fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224", # example 30M fine-tuned model
21 "/content/drive/MyDrive/project/Experiments/geneformer/input_data/human_dcm_hcm_nf.dataset",
22 "/content/drive/MyDrive/project/Experiments/geneformer/output/",

1 frames
/content/Geneformer/geneformer/emb_extractor.py in extract_embs(self, model_directory, input_data_file, output_directory, output_prefix, output_torch_embs, cell_state)
612 )
613 layer_to_quant = pu.quant_layers(model) + self.emb_layer
--> 614 embs = get_embs(
615 model=model,
616 filtered_input_data=downsampled_data,

/content/Geneformer/geneformer/emb_extractor.py in get_embs(model, filtered_input_data, emb_mode, layer_to_quant, pad_token_id, forward_batch_size, token_gene_dict, special_token, summary_stat, silent)
72 eos_present = any("" in value for value in token_gene_dict.values())
73 if emb_mode == "cls":
---> 74 assert cls_present, " token missing in token dictionary"
75 # Check to make sure that the first token of the filtered input data is cls token
76 gene_token_dict = {v: k for k, v in token_gene_dict.items()}

AssertionError: token missing in token dictionary
------------- END --------------------------

Based on my understanding of the discussion on this thread (https://huggingface.co/ctheodoris/Geneformer/discussions/435), I believe I am using the correct token dictionary (i.e. token_dictionary_gc30M.pkl) suited for the sample input dataset (i.e. human_dcm_hcm_nf.dataset). Please let me know if this is correct, and if so, what is the way to address this error.

Thank you for your patience.

ctheodoris

Owner Jan 29

Thank you for the additional information. Please try setting the emb_mode to be "cell" if you aren't using the new model that has a "cls" token.

(From the documentation):

vivekksingh

Feb 3

Your suggestion worked. Thank you for the help.

ctheodoris
/

Geneformer

AssertionError of extracts embedding from input data

initiate EmbExtractor

extracts embedding from input data

input data is tokenized rank value encodings generated by Geneformer tokenizer (see tokenizing_scRNAseq_data.ipynb)

example dataset: https://huggingface.co/datasets/ctheodoris/Genecorpus-#30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset

AssertionError: First token is not token value

Goal:

Input file used is as suggested in the code:

Code block:

initiate EmbExtractor

OF NOTE: token_dictionary_file must be set to the gc-30M token dictionary if using a 30M series model

(otherwise the EmbExtractor will use the current default model dictionary)

extracts embedding from input data

input data is tokenized rank value encodings generated by Geneformer tokenizer (see tokenizing_scRNAseq_data.ipynb)

example dataset for 30M model series: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset

Error: