Integrate Sentence Transformers, prevent manual tokenizer EOS

by tomaarsen HF staff - opened 11 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+119

-23

tomaarsen

11 days ago

•

edited 11 days ago

Hello!

Note

Congratulations on these model releases! Nice to see more strong reasonably sized embedding models, especially with nice features like MRL. Well done!

Pull Request overview

Integrate with Sentence Transformers (+ README updated, added Sentence Transformers tag to make this model easier to find)
Update the tokenizer.json TemplateProcessing so the EOS is always appended.
- Simplify modeling_drama.py _tokenize as the EOS is now handled automatically.
Rename self.forwardto self.encode in modeling_drama.py: this allows for ST to work, as it uses its own pooling.

Details

I noticed that you're using the Llama tokenizer, which (in)famously struggles with placing the EOS after the tokenized sequence. This is due to the TemplateProcessing, which only contains BOS and not EOS. I used Arthur's recommendation here (https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992) to resolve it, i.e. I ran

bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer._tokenizer.post_processor = Sequence(
    [
        ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True),
        TemplateProcessing(
            single=f"{bos}:0 $A:0 {eos}:0",
            pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
            special_tokens=[
                (f"{bos}", tokenizer.bos_token_id),
                (f"{eos}", tokenizer.eos_token_id),
            ],
        ),
    ]
)

and then saved that tokenizer. In tokenizer.json, the only updated lines are these:

...
  "post_processor": {
    "type": "Sequence",
    "processors": [
      {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
      },
      {
        "type": "TemplateProcessing",
        "single": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
+         },
+         {
+           "SpecialToken": {
+             "id": "<|end_of_text|>",
+             "type_id": 0
+           }
          }
        ],
        "pair": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          },
+         {
+           "SpecialToken": {
+             "id": "<|end_of_text|>",
+             "type_id": 0
+           }
+         },
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 1
            }
          },
          {
            "Sequence": {
              "id": "B",
              "type_id": 1
            }
+         },
+         {
+           "SpecialToken": {
+             "id": "<|end_of_text|>",
+             "type_id": 1
+           }
          }
        ],
        "special_tokens": {
          "<|begin_of_text|>": {
            "id": "<|begin_of_text|>",
            "ids": [
              128000
            ],
            "tokens": [
              "<|begin_of_text|>"
            ]
+         },
+         "<|end_of_text|>": {
+           "id": "<|end_of_text|>",
+           "ids": [
+             128001
+           ],
+           "tokens": [
+             "<|end_of_text|>"
+           ]
          }
        }
      }
    ]
  },
...

And the updates that you can see in special_tokens_map.json.
This allowed me to simplify the _tokenize method in your custom modeling code a lot. It should also be more efficient now.

I would recommend rerunning your code with this revision to experiment:

import torch
from transformers import AutoTokenizer, AutoModel


queries = [
    'What percentage of the Earth\'s atmosphere is oxygen?',
    '意大利首都是哪里？',
]
documents = [
    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
]

model_name = "facebook/drama-base"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name, revision="refs/pr/1")
model = AutoModel.from_pretrained(model_name, revision="refs/pr/1", trust_remote_code=True).to(device)

query_embs = model.encode_queries(tokenizer, queries)
doc_embs = model.encode_documents(tokenizer, documents)

scores = query_embs @ doc_embs.T
print(scores.tolist())
# Expected output: [[0.5310, 0.0821], [0.1298, 0.6181]]

# An extra test:
tokenized = tokenizer("This is my text")
decoded = tokenizer.decode(tokenized["input_ids"])
print(decoded)
# <|begin_of_text|>This is my text<|end_of_text|>

You'll notice that the results are the same, and that the tokenizer automatically uses the EOS.

Beyond these changes, I added the following Sentence Transformers (ST) files:

modules.json: Required, tells ST which "modules" to use. It uses Transformer, Pooling, and Normalize here.
sentence_bert_config.json: Optional, gives arguments for the Transformer module, notably the maximum sequence length of 8192.
config_sentence_transformers.json: Optional, stores info about prompts and the default similarity function (cosine similarity, "dot" also works as embeddings are normalized)
1_Pooling/config.json: Required, gives arguments to the Pooling module, tells it to use Mean pooling.

This means that the model is now much easier to use in third parties that integrate with Sentence Transformers, like LangChain, LlamaIndex, Haystack, etc.

Tom Aarsen

Integrate Sentence Transformers, prevent manual tokenizer EOS5d112dc6

tomaarsen changed pull request status to open 11 days ago

Fix accidental "." model_namee13f3085

tomaarsen

11 days ago

Also, I'd love to see these on MTEB. Note that there's an all-new way for submitting models since the MMTEB release from ~last week, described here: https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md
It consists of a simple PR to https://github.com/embeddings-benchmark/mteb and a PR to https://github.com/embeddings-benchmark/results. The first one should actually be easier with this PR merged, as then you can use the SentenceTransformerLoader.

Tom Aarsen

ccsasuke

AI at Meta org 8 days ago

Hi @tomaarsen , thank you so much for sending this PR, very neat changes! We'll get back to you when we have a chance to test your PR! 😀

One quick question since we've not used Sentence Transformers with our model yet.
model = SentenceTransformer("facebook/drama-base", truncate_dim=256, trust_remote_code=True)

How does SentenceTransformer handle normalization in this case? Does normalization happen after the truncation?

ccsasuke changed pull request status to merged 7 days ago

tomaarsen

7 days ago

Big apologies for missing your question until now!
To answer your question:
Sentence Transformers has 2 ways of carrying out normalization:

A Normalize module. This is rather "architectural", i.e. it's one of the modules that make up the model's forward.
A normalize_embeddings argument in model.encode. This is very much post-processing, and purposefully delayed until almost the end.

The truncation happens as the very first post-processing step, so it lives between those two options.

My proposal here uses the Normalize module, so the truncation happens afterwards. This means that users who don't use model.encode(..., normalize_embeddings=True) won't get normalized embeddings, despite the Normalize module. I recognize that this is a bit unfortunate.
The current model.similarity method is cosine similarity instead of dot product, so it should not affect the similarity computations, but it may make sense to update the model cards with:

from sentence_transformers import SentenceTransformer

queries = [
    'What percentage of the Earth\'s atmosphere is oxygen?',
    '意大利首都是哪里？',
]
documents = [
    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
]

model = SentenceTransformer("facebook/drama-base", truncate_dim=256, trust_remote_code=True)

query_embs = model.encode(queries, prompt_name="query", normalize_embeddings=True)
doc_embs = model.encode(documents, normalize_embeddings=True)

scores = model.similarity(query_embs, doc_embs)
print(scores.tolist())
# Expected output: [[0.6031, 0.1750], [0.2005, 0.7251]]

It's not crucial, though :)

Tom Aarsen

ccsasuke

AI at Meta org 6 days ago

@tomaarsen Thank you for the clarification! Yeah, I think it's a good idea to add normalize_embeddings=True to the README in this case. 😀

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment