Computing Sentence Embeddings
The basic function to compute sentence embeddings looks like this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
Note: Even though we talk about sentence embeddings, you can use it also for shorter phrases as well as for longer texts with multiple sentences. See the section on Input Sequence Length for more notes on embeddings for paragraphs.
First, we load a sentence-transformer model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model_name_or_path')
You can either specify a pre-trained model or you can pass a path on your disc to load the sentence-transformer model from that folder.
If available, the model is automatically executed on the GPU. You can specify the device for the model like this:
model = SentenceTransformer('model_name_or_path', device='cuda')
With device any pytorch device (like CPU, cuda, cuda:0 etc.)
The relevant method to encode a set of sentences / texts is model.encode()
. In the following, you can find parameters this method accepts. Some relevant parameters are batch_size (depending on your GPU a different batch size is optimal) as well as convert_to_numpy (returns a numpy matrix) and convert_to_tensor (returns a pytorch tensor).
.. autoclass:: sentence_transformers.SentenceTransformer
:members: encode
Input Sequence Length
Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT & Co. are 512 word pieces, which corresponds to about 300-400 words (for English). Longer texts than this are truncated to the first x word pieces.
By default, the provided methods use a limit fo 128 word pieces, longer inputs will be truncated. You can get and set the maximal sequence length like this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Max Sequence Length:", model.max_seq_length)
#Change the length to 200
model.max_seq_length = 200
print("Max Sequence Length:", model.max_seq_length)
Note: You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good.
Storing & Loading Embeddings
The easiest method is to use pickle to store pre-computed embeddings on disc and to load it from disc. This can especially be useful if you need to encode large set of sentences.
from sentence_transformers import SentenceTransformer
import pickle
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
embeddings = model.encode(sentences)
#Store sentences & embeddings on disc
with open('embeddings.pkl', "wb") as fOut:
pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)
#Load sentences & embeddings from disc
with open('embeddings.pkl', "rb") as fIn:
stored_data = pickle.load(fIn)
stored_sentences = stored_data['sentences']
stored_embeddings = stored_data['embeddings']
Multi-Process / Multi-GPU Encoding
You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). For an example, see: computing_embeddings_mutli_gpu.py.
The relevant method is start_multi_process_pool()
, which starts multiple processes that are used for encoding.
.. automethod:: sentence_transformers.SentenceTransformer.start_multi_process_pool
Sentence Embeddings with Transformers
Most of our pre-trained models are based on Huggingface.co/Transformers and are also hosted in the models repository from Huggingface. It is possible to use our sentence embeddings models without installing sentence-transformers:
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
#Sentences we want sentence embeddings for
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
You can find the available models here: https://huggingface.co/sentence-transformers
In the above example we add mean pooling on top of the AutoModel (which will load a BERT model). We also have models with max-pooling and where we use the CLS token. How to apply this pooling correctly, have a look at sentence-transformers/bert-base-nli-max-tokens and /sentence-transformers/bert-base-nli-cls-token.