Is the order of word embeddings in a pretrained model the same as the order of token in the dictionary?
Hello.
I loaded the model directly and printed its word embeddings using these codes:
# Load model directly
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer")
model.bert.embeddings.words_embeddings.weight
- I guess it loads the default model, GF-12L-95M-i4096. Could you confirm it?
- Is the order of rows in the word embedding the same as the order of tokens in the "token_dictionary_gc95m.pkl" file?
In other words, is the first row of word embeddings for the embedding values of the token 0 (<pad>)? Then the second row is for token 1 (<mask>), the third for <cls>, the fourth row for <eos>, and the fifth row for token 4 ("ENSG00000000003")?
Thanks for your question. That’s correct for your first question. For your second question, we do not access the embeddings prior to the model layers as they have not benefited from the model’s representation of the data. You can extract embeddings from later layers with the embedding extraction code we provide in this repository.
Thank you for your quick reply!
I'm sorry, but I'm still confused about the word embeddings. I thought those embeddings were the results of pretraining with the Genecorpus data. Are they initial values before any training?
Also, I tried the embedding extraction code. However, it seemed to retrain the pretrained model with my input data rather than keeping the pretrained values. I'm interested in getting the embeddings from the pretrained model, not fine-tuning results.
You can follow this example to extract pretrained embeddings. Please check the documentation for the different options. You should use the pretrained model and use “Prettained” as the model type, with 0 classes since it’s not a fine-tuned classifier. Using the embedding layer as -1 will extract embeddings from the second to last layer and take advantage of the model weights. 0 will be the last layer.