Hindi Sentence Embeddings Model

This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.

Features

Specialized for Hindi language text
Advanced transformer architecture with optimized attention mechanism
Multiple pooling strategies for enhanced semantic representations
Creates normalized vector representations for semantic similarity
Supports semantic search and text similarity applications

Usage

Installation

pip install torch sentencepiece scikit-learn matplotlib
git lfs install 
git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model
cd hindi-embedding-foundational-model

Enhanced RAG System

This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval.

Setup and Installation

Install additional dependencies:

pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu

Index your documents:

python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index

Run in QA mode with LLM:

python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa

Basic Embedding Usage

from hindi_embeddings import HindiEmbedder

# Initialize the embedder
model = HindiEmbedder("path/to/hindi-embedding-foundational-model")

# Encode sentences to embeddings
sentences = [
    "मुझे हिंदी भाषा बहुत पसंद है।",
    "मैं हिंदी भाषा सीख रहा हूँ।"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity between sentences
similarity = model.compute_similarity(sentences[0], sentences[1])
print(f"Similarity: {similarity:.4f}")

# Perform semantic search
query = "भारत की राजधानी"
documents = [
    "दिल्ली भारत की राजधानी है।",
    "मुंबई भारत का सबसे बड़ा शहर है।",
    "हिमालय पर्वत भारत के उत्तर में स्थित है।"
]
results = model.search(query, documents)
for i, result in enumerate(results):
    print(f"{i+1}. Score: {result['score']:.4f}")
    print(f"   Document: {result['document']}")

# Visualize embeddings
example_sentences = [
    "मुझे हिंदी में पढ़ना बहुत पसंद है।",
    "आज मौसम बहुत अच्छा है।",
    "भारत एक विशाल देश है।"
]
model.visualize_embeddings(example_sentences)

Model Details

This model uses an advanced transformer-based architecture with the following enhancements:

Pre-layer normalization for stable training
Specialized attention mechanism with relative positional encoding
Multiple pooling strategies (weighted, mean, attention-based)
L2-normalized vectors for cosine similarity

Technical specifications:

Embedding dimension: 768
Hidden dimension: 768
Layers: 12
Attention heads: 12
Vocabulary size: 50,000
Context length: 128 tokens

Applications

Semantic search and information retrieval
Text clustering and categorization
Recommendation systems
Question answering
Document similarity comparison
Content-based filtering
RAG systems for Hindi language content

License

This model is released under the MIT License.

Citation

If you use this model in your research or application, please cite us:

@misc{DeepMostInnovations2025hindi,
  author = {DeepMost Innovations},
  title = {Hindi Sentence Embeddings Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}}
}