NaverHustQA/viLegal_bi

This is an encoder model for Vietnamese legal domain: It maps legal queries & contexts to a 768 dimensional dense vector space and can be used for information retrieval.

We use vinai/phobert-base-v2 as the pre-trained backbone.

Usage (HuggingFace Transformers)

You can use the model like below (Remember to word-segment inputs first):

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling
def cls_pooling(model_output):
    return model_output['last_hidden_state'][:,0,:]

# Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
sentences = ['Uống rượu lái_xe bị phạt bao_nhiêu tiền ?', 'Bao_nhiêu tuổi phải làm CCCD ?', 'Uống rượu lái_xe bị phạt 500,000 đồng .']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('NaverHustQA/viLegal_bi')
model = AutoModel.from_pretrained('NaverHustQA/viLegal_bi')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling.
sentence_embeddings = cls_pooling(model_output)

print("Sentence embeddings:")
print(sentence_embeddings)

Training

You can find full information of our training methods and datasets in our report.

Authors

Le Thanh Huong, Nguyen Nhat Quang.