Model Card for LuxEmbedder

Model Summary

LuxEmbedder is a sentence-transformers model that transforms sentences and paragraphs into 768-dimensional dense vectors, enabling tasks like clustering and semantic search, with a primary focus on Luxembourgish. Leveraging a cross-lingual approach, LuxEmbedder effectively handles Luxembourgish text while also mapping input from 108 other languages into a shared embedding space. For the full list of supported languages, refer to the sentence-transformers/LaBSE documentation, as LaBSE served as the foundation for LuxEmbedder.

This model was introduced in LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings (Philippy et al., 2024). It addresses the challenges of limited parallel data for Luxembourgish by creating LuxAlign, a high-quality, human-generated parallel dataset, which forms the basis for LuxEmbedder’s competitive performance across cross-lingual and monolingual tasks for Luxembourgish.

With the release of LuxEmbedder, we also provide a Luxembourgish paraphrase detection benchmark, ParaLux to encourage further exploration and development in NLP for Luxembourgish.

Model type: Sentence Embedding Model
Language(s) (NLP): Luxembourgish + 108 additional languages
License: Creative Commons Attribution Non Commercial 4.0 International (CC BY-NC 4.0)
Architecture: Based on LaBSE
Paper: LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings (Philippy et al., 2024)
Repository: https://github.com/fredxlpy/LuxEmbedder

Example Usage

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util
import numpy as np
import pandas as pd

# Load the model
model = SentenceTransformer('fredxlpy/LuxEmbedder')

# Example sentences
data = pd.DataFrame({
    "id": ["lb1", "lb2", "lb3", "en1", "en2", "en3", "zh1", "zh2", "zh3"],
    "text": [
        "Moien, wéi geet et?",         # Luxembourgish: Hello, how are you?
        "D'Wieder ass haut schéin.",   # Luxembourgish: The weather is beautiful today.
        "Ech schaffen am Büro.",       # Luxembourgish: I work in the office.
        "Hello, how are you?",         
        "The weather is great today.", 
        "I work in an office.",        
        "你好, 你怎么样?",               # Chinese: Hello, how are you?
        "今天天气很好.",                 # Chinese: The weather is very good today.
        "我在办公室工作."                # Chinese: I work in an office.
    ]
})

# Encode the sentences to obtain sentence embeddings
embeddings = model.encode(data["text"].tolist(), convert_to_tensor=True)

# Compute the cosine similarity matrix
cosine_similarity_matrix = util.cos_sim(embeddings, embeddings).cpu().numpy()

# Create a DataFrame for the similarity matrix with "id" as row and column labels
similarity_df = pd.DataFrame(
    np.round(cosine_similarity_matrix, 2),
    index=data["id"],
    columns=data["id"]
)

# Display the similarity matrix
print("Cosine Similarity Matrix:")
print(similarity_df)

# Cosine Similarity Matrix:
# id    lb1   lb2   lb3   en1   en2   en3   zh1   zh2   zh3
# id                                                       
# lb1  1.00  0.60  0.42  0.96  0.59  0.40  0.95  0.62  0.43
# lb2  0.60  1.00  0.41  0.56  0.99  0.39  0.56  0.99  0.42
# lb3  0.42  0.41  1.00  0.44  0.42  0.99  0.46  0.43  0.99
# en1  0.96  0.56  0.44  1.00  0.55  0.43  0.99  0.58  0.46
# en2  0.59  0.99  0.42  0.55  1.00  0.40  0.55  0.99  0.43
# en3  0.40  0.39  0.99  0.43  0.40  1.00  0.44  0.41  0.99
# zh1  0.95  0.56  0.46  0.99  0.55  0.44  1.00  0.58  0.47
# zh2  0.62  0.99  0.43  0.58  0.99  0.41  0.58  1.00  0.44
# zh3  0.43  0.42  0.99  0.46  0.43  0.99  0.47  0.44  1.00

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)

Citation

@misc{philippy2024luxembedder,
      title={LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings}, 
      author={Fred Philippy and Siwen Guo and Jacques Klein and Tegawendé F. Bissyandé},
      year={2024},
      eprint={2412.03331},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.03331}, 
}

fredxlpy
/

LuxEmbedder

Model Card for LuxEmbedder

Model Summary

Example Usage

Full Model Architecture

Citation

Model tree for fredxlpy/LuxEmbedder

Dataset used to train fredxlpy/LuxEmbedder