--- license: mit library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity --- # gte-micro This is a distill of [gte-small](https://huggingface.co/thenlper/gte-small). ## Intended purpose This model is designed for use in semantic-autocomplete ([click here for demo](https://mihaiii.github.io/semantic-autocomplete/)). ## Usage (same as [gte-small](https://huggingface.co/thenlper/gte-small)) Use in [semantic-autocomplete](https://github.com/Mihaiii/semantic-autocomplete) OR in code ```python import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] input_texts = [ "what is the capital of China?", "how to implement quick sort in python?", "Beijing", "sorting algorithms" ] tokenizer = AutoTokenizer.from_pretrained("Mihaiii/gte-micro") model = AutoModel.from_pretrained("Mihaiii/gte-micro") # Tokenize the input texts batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) # (Optionally) normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:1] @ embeddings[1:].T) * 100 print(scores.tolist()) ``` Use with sentence-transformers: ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim sentences = ['That is a happy person', 'That is a very happy person'] model = SentenceTransformer('Mihaiii/gte-micro') embeddings = model.encode(sentences) print(cos_sim(embeddings[0], embeddings[1])) ``` ### Limitation (same as [gte-small](https://huggingface.co/thenlper/gte-small)) This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.