thekop79 commited on
Commit
b7e71da
·
verified ·
1 Parent(s): a02af01

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: sentence-transformers
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ pipeline_tag: sentence-similarity
11
+ ---
12
+ Distilbert encoder models trained on European law document tagging dataset (EURLex-4K) using [DEXML with cross-batch mix negative sampling ](https://github.com/thekop69/two-tower-dissertation) originally adapted from ([Dual Encoder for eXtreme Multi-Label classification, ICLR'24](https://arxiv.org/pdf/2310.10636v2.pdf)) method.
13
+
14
+ ## Inference Usage (Sentence-Transformers)
15
+ With `sentence-transformers` installed you can use this model as following:
16
+ ```python
17
+ from sentence_transformers import SentenceTransformer
18
+ sentences = ["This is an example sentence", "Each sentence is converted"]
19
+ model = SentenceTransformer('quicktensor/dexml_eurlex-4k')
20
+ embeddings = model.encode(sentences)
21
+ print(embeddings)
22
+ ```
23
+
24
+ ## Usage (HuggingFace Transformers)
25
+ With huggingface transformers you only need to be a bit careful with how you pool the transformer output to get the embedding, you can use this model as following;
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModel
28
+ import torch
29
+ import torch.nn.functional as F
30
+ pooler = lambda x: F.normalize(x[:, 0, :], dim=-1) # Choose CLS token and normalize
31
+ sentences = ["This is an example sentence", "Each sentence is converted"]
32
+ tokenizer = AutoTokenizer.from_pretrained('quicktensor/dexml_eurlex-4k')
33
+ model = AutoModel.from_pretrained('quicktensor/dexml_eurlex-4k')
34
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
35
+ with torch.no_grad():
36
+ embeddings = pooler(model(**encoded_input))
37
+ print(embeddings)
38
+ ```
39
+
40
+ ## Cite the original authors
41
+ If you found this model helpful, please cite our work as:
42
+ ```bib
43
+ @InProceedings{DEXML,
44
+ author = "Gupta, N. and Khatri, D. and Rawat, A-S. and Bhojanapalli, S. and Jain, P. and Dhillon, I.",
45
+ title = "Dual-encoders for Extreme Multi-label Classification",
46
+ booktitle = "International Conference on Learning Representations",
47
+ month = "May",
48
+ year = "2024"
49
+ }
50
+ ```