SAP
/

miCSE

Sentence Similarity

feature-extraction

text-embeddings-inference

Model card Files Files and versions

TJKlein commited on Nov 18, 2022

Commit

d1858c9

·

1 Parent(s): de95105

Update README.md

Files changed (1) hide show

README.md +32 -4

README.md CHANGED Viewed

@@ -10,14 +10,42 @@ Language model of the pre-print arXiv paper titled: "_**miCSE**: Mutual Informat
 The **miCSE** language model is trained for sentence similarity computation. Training the model imposes alignment between the attention pattern of different views (embeddings of augmentations) during contrastive learning. Learning sentence embeddings with **miCSE** entails enforcing the syntactic consistency across augmented views for every single sentence, making contrastive self-supervised learning more sample efficient. Sentence representations correspond to the embedding of the _**[CLS]**_ token.
-# Usage
 ```shell
-tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/<----Enter Model Name---->")
-model = AutoModelWithLMHead.from_pretrained("sap-ai-research/<----Enter Model Name---->")
 ```
 # Benchmark
 Model results on SentEval Benchmark:

 The **miCSE** language model is trained for sentence similarity computation. Training the model imposes alignment between the attention pattern of different views (embeddings of augmentations) during contrastive learning. Learning sentence embeddings with **miCSE** entails enforcing the syntactic consistency across augmented views for every single sentence, making contrastive self-supervised learning more sample efficient. Sentence representations correspond to the embedding of the _**[CLS]**_ token.
+# Model Usage
 ```shell
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/miCSE")
+model = AutoModel.from_pretrained("sap-ai-research/miCSE")
+# Encoding of sentences in a list with a predefined maximum lengths of tokens (max_length)
+max_length = 32
+sentences = [
+    "This is a sentence for testing miCSE.",
+    "This is yet another test sentence for the mutual information Contrastive Sentence Embeddings model."
+]
+batch = tokenizer.batch_encode_plus(
+                sentences,
+                return_tensors='pt',
+                padding=True,
+                max_length=max_length,
+                truncation=True
+            )
+# Compute the embeddings
+outputs = model(**batch, output_hidden_states=True, return_dict=True)
+embeddings = outputs.last_hidden_state[:,0]
 ```
 # Benchmark
 Model results on SentEval Benchmark: