NaverHustQA/viLegal_bi
This is an encoder model for Vietnamese legal domain: It maps legal queries & contexts to a 768 dimensional dense vector space and can be used for information retrieval.
We use vinai/phobert-base-v2 as the pre-trained backbone.
Usage (HuggingFace Transformers)
You can use the model like below (Remember to word-segment inputs first):
from transformers import AutoTokenizer, AutoModel
import torch
#CLS Pooling
def cls_pooling(model_output):
return model_output['last_hidden_state'][:,0,:]
# Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
sentences = ['Uống rượu lái_xe bị phạt bao_nhiêu tiền ?', 'Bao_nhiêu tuổi phải làm CCCD ?', 'Uống rượu lái_xe bị phạt 500,000 đồng .']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('NaverHustQA/viLegal_bi')
model = AutoModel.from_pretrained('NaverHustQA/viLegal_bi')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling.
sentence_embeddings = cls_pooling(model_output)
print("Sentence embeddings:")
print(sentence_embeddings)
Training
You can find full information of our training methods and datasets in our report.
Authors
Le Thanh Huong, Nguyen Nhat Quang.
- Downloads last month
- 9
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support sentence-similarity models for generic library.