Integrate Sentence Transformers, prevent manual tokenizer EOS
Hello!
Note
Congratulations on these model releases! Nice to see more strong reasonably sized embedding models, especially with nice features like MRL. Well done!
Pull Request overview
- Integrate with Sentence Transformers (+ README updated, added Sentence Transformers tag to make this model easier to find)
- Update the
tokenizer.json
TemplateProcessing
so the EOS is always appended.- Simplify
modeling_drama.py
_tokenize
as the EOS is now handled automatically.
- Simplify
- Rename
self.forward
toself.encode
inmodeling_drama.py
: this allows for ST to work, as it uses its own pooling.
Details
I noticed that you're using the Llama tokenizer, which (in)famously struggles with placing the EOS after the tokenized sequence. This is due to the TemplateProcessing
, which only contains BOS and not EOS. I used Arthur's recommendation here (https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992) to resolve it, i.e. I ran
bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer._tokenizer.post_processor = Sequence(
[
ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True),
TemplateProcessing(
single=f"{bos}:0 $A:0 {eos}:0",
pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
special_tokens=[
(f"{bos}", tokenizer.bos_token_id),
(f"{eos}", tokenizer.eos_token_id),
],
),
]
)
and then saved that tokenizer. In tokenizer.json
, the only updated lines are these:
...
"post_processor": {
"type": "Sequence",
"processors": [
{
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": false,
"use_regex": true
},
{
"type": "TemplateProcessing",
"single": [
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
+ },
+ {
+ "SpecialToken": {
+ "id": "<|end_of_text|>",
+ "type_id": 0
+ }
}
],
"pair": [
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
+ {
+ "SpecialToken": {
+ "id": "<|end_of_text|>",
+ "type_id": 0
+ }
+ },
{
"SpecialToken": {
"id": "<|begin_of_text|>",
"type_id": 1
}
},
{
"Sequence": {
"id": "B",
"type_id": 1
}
+ },
+ {
+ "SpecialToken": {
+ "id": "<|end_of_text|>",
+ "type_id": 1
+ }
}
],
"special_tokens": {
"<|begin_of_text|>": {
"id": "<|begin_of_text|>",
"ids": [
128000
],
"tokens": [
"<|begin_of_text|>"
]
+ },
+ "<|end_of_text|>": {
+ "id": "<|end_of_text|>",
+ "ids": [
+ 128001
+ ],
+ "tokens": [
+ "<|end_of_text|>"
+ ]
}
}
}
]
},
...
And the updates that you can see in special_tokens_map.json
.
This allowed me to simplify the _tokenize
method in your custom modeling code a lot. It should also be more efficient now.
I would recommend rerunning your code with this revision to experiment:
import torch
from transformers import AutoTokenizer, AutoModel
queries = [
'What percentage of the Earth\'s atmosphere is oxygen?',
'意大利首都是哪里?',
]
documents = [
"The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
"羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心,位于意大利半島中部的台伯河下游平原地,建城初期在七座小山丘上,故又名“七丘之城”。按城市范围内的人口计算,罗马是意大利人口最多的城市,也是欧盟人口第三多的城市。",
]
model_name = "facebook/drama-base"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name, revision="refs/pr/1")
model = AutoModel.from_pretrained(model_name, revision="refs/pr/1", trust_remote_code=True).to(device)
query_embs = model.encode_queries(tokenizer, queries)
doc_embs = model.encode_documents(tokenizer, documents)
scores = query_embs @ doc_embs.T
print(scores.tolist())
# Expected output: [[0.5310, 0.0821], [0.1298, 0.6181]]
# An extra test:
tokenized = tokenizer("This is my text")
decoded = tokenizer.decode(tokenized["input_ids"])
print(decoded)
# <|begin_of_text|>This is my text<|end_of_text|>
You'll notice that the results are the same, and that the tokenizer automatically uses the EOS.
Beyond these changes, I added the following Sentence Transformers (ST) files:
modules.json
: Required, tells ST which "modules" to use. It uses Transformer, Pooling, and Normalize here.sentence_bert_config.json
: Optional, gives arguments for the Transformer module, notably the maximum sequence length of 8192.config_sentence_transformers.json
: Optional, stores info about prompts and the default similarity function (cosine similarity, "dot" also works as embeddings are normalized)1_Pooling/config.json
: Required, gives arguments to the Pooling module, tells it to use Mean pooling.
This means that the model is now much easier to use in third parties that integrate with Sentence Transformers, like LangChain, LlamaIndex, Haystack, etc.
- Tom Aarsen
Also, I'd love to see these on MTEB. Note that there's an all-new way for submitting models since the MMTEB release from ~last week, described here: https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md
It consists of a simple PR to https://github.com/embeddings-benchmark/mteb and a PR to https://github.com/embeddings-benchmark/results. The first one should actually be easier with this PR merged, as then you can use the SentenceTransformerLoader.
- Tom Aarsen
Hi @tomaarsen , thank you so much for sending this PR, very neat changes! We'll get back to you when we have a chance to test your PR! 😀
One quick question since we've not used Sentence Transformers with our model yet.model = SentenceTransformer("facebook/drama-base", truncate_dim=256, trust_remote_code=True)
How does SentenceTransformer
handle normalization in this case? Does normalization happen after the truncation?
Big apologies for missing your question until now!
To answer your question:
Sentence Transformers has 2 ways of carrying out normalization:
- A
Normalize
module. This is rather "architectural", i.e. it's one of the modules that make up the model'sforward
. - A
normalize_embeddings
argument inmodel.encode
. This is very much post-processing, and purposefully delayed until almost the end.
The truncation happens as the very first post-processing step, so it lives between those two options.
My proposal here uses the Normalize
module, so the truncation happens afterwards. This means that users who don't use model.encode(..., normalize_embeddings=True)
won't get normalized embeddings, despite the Normalize
module. I recognize that this is a bit unfortunate.
The current model.similarity
method is cosine similarity instead of dot product, so it should not affect the similarity computations, but it may make sense to update the model cards with:
from sentence_transformers import SentenceTransformer
queries = [
'What percentage of the Earth\'s atmosphere is oxygen?',
'意大利首都是哪里?',
]
documents = [
"The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
"羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心,位于意大利半島中部的台伯河下游平原地,建城初期在七座小山丘上,故又名“七丘之城”。按城市范围内的人口计算,罗马是意大利人口最多的城市,也是欧盟人口第三多的城市。",
]
model = SentenceTransformer("facebook/drama-base", truncate_dim=256, trust_remote_code=True)
query_embs = model.encode(queries, prompt_name="query", normalize_embeddings=True)
doc_embs = model.encode(documents, normalize_embeddings=True)
scores = model.similarity(query_embs, doc_embs)
print(scores.tolist())
# Expected output: [[0.6031, 0.1750], [0.2005, 0.7251]]
It's not crucial, though :)
- Tom Aarsen
@tomaarsen
Thank you for the clarification! Yeah, I think it's a good idea to add normalize_embeddings=True
to the README in this case. 😀