Ita-Search / README.md
DeepMount00's picture
Upload README.md
f0c04f1 verified
|
raw
history blame
5.44 kB
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - information-retrieval
  - semantic-search
widget:
  - source_sentence: >-
      Descrivi dettagliatamente il processo chimico e fisico che avviene durante
      la preparazione di un impasto per crostata
    sentences:
      - >-
        ## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio
        Dagli Ingredienti Secchi al Trionfo del Forno


        La preparazione di una crostata, apparentemente un gesto semplice e
        familiare, cela in realtà un affascinante balletto di reazioni chimiche
        e trasformazioni fisiche...
      - >-
        ## L'Arte Effimera: Creare un Dolce Paesaggio Invernale


        Immergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte
        culinaria si fonde con la creatività artistica...
      - >-
        Le piattaforme di comunicazione digitale, con la loro ubiquità
        crescente, si configurano come un'arma a doppio taglio nel panorama
        sociale contemporaneo...
pipeline_tag: sentence-similarity
library_name: sentence-transformers

Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval

This model is a specialized fine-tuned version of Qwen/Qwen3-Embedding-0.6B optimized for cross-lingual semantic retrieval tasks, with particular emphasis on Italian query understanding and multilingual document ranking.

Model Description

  • Model Type: Dense embedding model for semantic retrieval
  • Base Model: Qwen/Qwen3-Embedding-0.6B
  • Output Dimensionality: 1,024-dimensional dense vectors
  • Maximum Sequence Length: 32,768 tokens
  • Primary Languages: Italian, English
  • Similarity Function: Cosine similarity

Capabilities

Cross-Lingual Retrieval

The model demonstrates strong performance in matching Italian queries to English documents and vice versa, particularly effective in technical and academic domains.

Domain Coverage

Trained on diverse knowledge domains including:

  • Medical & Health Sciences: Diagnostic imaging, clinical procedures, medical terminology
  • STEM Fields: Physics, computer science, geology, engineering
  • Professional Domains: Finance, law, agriculture, software development
  • Educational Content: Historical studies, culinary arts, general knowledge

Query Understanding

Enhanced comprehension of:

  • Conversational and informal query patterns
  • Technical terminology across domains
  • Cross-lingual semantic concepts
  • Complex multi-faceted questions

Training Data

The model was fine-tuned on a curated corpus of Italian-English cross-lingual data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:

  • Hard negative mining: Strategic inclusion of semantically related but incorrect documents
  • Cross-lingual alignment: Balanced representation of Italian-English language pairs
  • Domain diversity: Comprehensive coverage of academic, professional, and conversational contexts
  • Quality curation: Manual review and automated filtering for coherence and relevance

Usage

Basic Retrieval

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("your-model-name")

# Cross-lingual query-document matching
query = "Come si distingue una faglia trascorrente da una normale?"
documents = [
    "Strike-slip faults are characterized by horizontal movement...",
    "Normal faults occur due to extensional stress...",
    "Investment portfolio management strategies..."
]

query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
doc_embeddings = model.encode(documents, prompt="Represent this passage for retrieval: ")
similarities = model.similarity(query_embedding, doc_embeddings)

Prompt Templates

The model is optimized for specific prompt templates:

  • Queries: "Represent this search query for finding relevant passages: "
  • Documents: "Represent this passage for retrieval: "

Applications

  • Cross-lingual information retrieval systems
  • Academic and technical document search
  • Multilingual question-answering platforms
  • Educational content recommendation
  • Professional knowledge base systems

Limitations

  • Language coverage: Primarily optimized for Italian-English pairs
  • Domain specificity: Performance may vary on highly specialized domains not represented in training
  • Cultural context: Reflects primarily Western/European knowledge perspectives
  • Computational requirements: Dense representations require significant storage for large-scale deployment

Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 32768, 'architecture': 'Qwen3Model'})
  (1): Pooling({'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
)

Citation

@misc{qwen3-italian-retrieval-2024,
  title={Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval},
  year={2024},
  howpublished={\\url{https://huggingface.co/your-model-name}}
}

Acknowledgments

This work builds upon the Qwen3-Embedding architecture and advances in contrastive learning for dense retrieval. We acknowledge the contributions of the Qwen team and the sentence-transformers community.


License: Inherits licensing terms from the base Qwen/Qwen3-Embedding-0.6B model.