Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications

Community Article Published August 29, 2025

Upvote

While Large Language Models (LLMs) have revolutionized text generation and comprehension, their tendency to "hallucinate"—generating plausible but factually incorrect or fabricated information—remains a significant barrier to their reliability. Retrieval-Augmented Generation (RAG) systems were developed to address this by grounding model responses in external knowledge sources. However, even RAG systems are not entirely immune to hallucinations, especially when dealing with morphologically complex and relatively low-resource languages like Turkish.

To address this challenge, we introduce Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications. By open-sourcing both the models and the translated dataset, this work aims to establish a foundation for developing more reliable and trustworthy AI applications for Turkish and other similar languages.

Quick Links Hugging Face Models and Dataset:

ModernBERT-tr-uncased-stsb-HD: https://huggingface.co/newmindai/ModernBERT-tr-uncased-stsb-HD
TurkEmbed4STS-HD: https://huggingface.co/newmindai/TurkEmbed4STS-HD
lettucedect-210m-eurobert-tr-v1: https://huggingface.co/newmindai/lettucedect-210m-eurobert-tr-v1
RAGTruth-TR: https://huggingface.co/datasets/newmindai/RAGTruth-TR

The Hallucination Problem and the Unique Challenges of Turkish

One of the biggest drawbacks of LLMs is their tendency to produce factually incorrect content by over-relying on their parametric knowledge. RAG mitigates this issue by providing the model with a "knowledge base." Before answering a query, the model retrieves relevant documents and bases its response on the information found within them. However, this architecture is not flawless. Hallucinations can still occur for several reasons:

Incomplete or contradictory context: The retrieved documents may contain insufficient or conflicting information to answer the query accurately.
Misinterpretation: The model might misunderstand the provided context.
Linguistic complexity: For languages like Turkish, which have an agglutinative and rich morphology, the complexity of word structures makes hallucination detection more challenging compared to morphologically simpler languages like English.

Current approaches to hallucination detection are generally divided into two categories:

Prompt-based methods that use large LLMs as "judges," and smaller, specialized encoder-based models. While prompt-based methods are flexible, their high computational cost and slow inference make them impractical for many real-world applications.
Encoder-based models are more efficient but often struggle to process the long contexts typical in RAG applications.

Turk-LettuceDetect: Our Approach and Methodology

Turk-LettuceDetect fills this gap by building on the LettuceDetect framework. This approach formulates hallucination detection as a token-level classification task. In other words, each token (word) in a model-generated response is labeled as either "supported" or "hallucinated" based on the provided context.

To accomplish this, we fine-tuned three distinct encoder architectures specifically for Turkish:

-ModernBERT-base-tr: A Turkish-specific ModernBERT model has been developed, featuring modern enhancements like Rotary Position Embeddings (RoPE). It's capable of processing long contexts up to 8,192 tokens and has been fine-tuned on Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks.

-TurkEmbed4STS: A Turkish embedding model optimized for semantic textual similarity tasks.

-EuroBERT: A powerful, pre-existing multilingual model known for its robust cross-lingual capabilities.

The Dataset: Adapting RAGTruth for Turkish

To train our models, we used RAGTruth, the first large-scale benchmark dataset specifically designed to evaluate hallucination phenomena in RAG settings. The dataset contains 17,790 annotated instances across three distinct tasks: question answering, data-to-text generation, and summarization. We machine-translated this originally English dataset into Turkish using the google/gemma-3-27b-it model. During the translation process, we took great care to preserve the structure and position of the tags, which mark the hallucinated spans. This ensured that the model could learn to identify which parts of the text were hallucinations, thereby maintaining the integrity of the dataset's evaluation framework.

Turk-LettuceDetect: Türkçe RAG Halüsinasyon Tespiti – Prompts (TR)

Example of Translated RAGTruth Data

Example of Translated RAGTruth Data

PROMPT Aşağıdaki haberi 116 kelimeyle özetleyin:
Yetmiş yıl önce Anne Frank, 15 yaşında bir Nazi toplama kampında tifo nedeniyle öldü. 31 Mart 1945'te öldüğüne dair varsayılan tarihten sadece iki hafta sonra, tutulduğu Bergen-Belsen toplama kampı kurtarıldı; bu zamanlama, Yahudi günlük yazarının Holokost'tan kurtulmaya ne kadar yakın olduğunu gösterdi. Ancak Anne Frank Evi tarafından...

ANSWER Anne Frank Evi, Anne Frank ve ablası Margot'nun daha önce düşünüldüğünden en az bir ay önce öldüğünü açıkladı. Holokost sırasında Nazi toplama kamplarında hapsedilen kız kardeşlerin Mart 1945'te, Bergen-Belsen kampının kurtarılmasından sadece iki hafta önce...

ANNOTATION labels: []
Type: Summary
split: train
language: tr

PROMPT	Aşağıdaki haberi 116 kelimeyle özetleyin: Yetmiş yıl önce Anne Frank, 15 yaşında bir Nazi toplama kampında tifo nedeniyle öldü. 31 Mart 1945'te öldüğüne dair varsayılan tarihten sadece iki hafta sonra, tutulduğu Bergen-Belsen toplama kampı kurtarıldı; bu zamanlama, Yahudi günlük yazarının Holokost'tan kurtulmaya ne kadar yakın olduğunu gösterdi. Ancak Anne Frank Evi tarafından...
ANSWER	Anne Frank Evi, Anne Frank ve ablası Margot'nun daha önce düşünüldüğünden en az bir ay önce öldüğünü açıkladı. Holokost sırasında Nazi toplama kamplarında hapsedilen kız kardeşlerin Mart 1945'te, Bergen-Belsen kampının kurtarılmasından sadece iki hafta önce...
ANNOTATION	labels: `[]` Type: Summary split: train language: tr

PROMPT	Aşağıdaki haberi 116 kelimeyle özetleyin: Yetmiş yıl önce Anne Frank, Nazi toplama kampında 15 yaşında tifo nedeniyle öldü. 31 Mart 1945'te öldüğü varsayılan tarihten sadece iki hafta sonra, tutuklu olduğu Bergen-Belsen toplama kampı kurtarıldı; bu zamanlama, Yahudi günlük yazarının Holokost'tan kurtulmaya ne kadar...
ANSWER	Anne Frank Evi tarafından yayınlanan yeni bir araştırma, Anne Frank ve ablası Margot'nun daha önce inanıldığından en az bir ay önce öldüğünü ortaya koyuyor. Araştırma, arşivlerin ve hayatta kalanların ifadelerinin yenide...
ANNOTATION	labels: `{ "start": 545, "end": 596, "label": "Evident Conflict"}, {"start": 824,"end": 906,"label": "Evident Baseless Info" }` Type: Summary split: train language: tr

Findings and Performance

Our comprehensive experiments demonstrate that the developed models are highly effective at detecting hallucinations in Turkish text.

Performance of Token-Level Hallucination Detection Across Models

Model	Task Type	Precision	Recall	F1-Score	AUROC
ModernBERT-base-tr	Summary	0.6935	0.5705	0.6007	0.5705
	Data2txt	0.7652	0.7182	0.7391	0.7182
	QA	0.7642	0.7536	0.7588	0.7536
	Whole Dataset	0.7583	0.7024	0.7266	0.7024
TurkEmbed4STS	Summary	0.6325	0.5656	0.5862	0.5656
	Data2txt	0.7397	0.7333	0.7365	0.7333
	QA	0.7378	0.7382	0.7380	0.7382
	Whole Dataset	0.7268	0.7014	0.7132	0.7014
lettucedect-210m-eurobert-tr	Summary	0.6465	0.5546	0.5771	0.5546
	Data2txt	0.7866	0.7218	0.7496	0.7218
	QA	0.7388	0.7262	0.7323	0.7262
	Whole Dataset	0.7511	0.6908	0.7163	0.6908

Performance of Example-Level Hallucination Detection Across Models

Key Highlights:

ModernBERT Leads the Pack: The Turkish-specific ModernBERT-base-tr model delivered the best overall performance, achieving an F1-score of 0.7266. It particularly excelled in structured tasks like question answering.

Task-Dependent Performance: Model performance varied depending on the task. Summarization, being inherently more abstract, proved to be the most challenging domain for hallucination detection.

Efficiency and Speed: Our models are significantly more efficient than large LLMs. Training on a single NVIDIA A100 GPU took only 2 hours per model. This enables fast, low-cost hallucination detection in real-time applications. Long Context Support: Our ModernBERT-based model can process inputs up to 8,192 tokens long, a critical capability for analyzing the extensive reference documents commonly used in RAG systems.

Conclusion and Future Vision

Turk-LettuceDetect is a foundational step toward enhancing the reliability of Turkish RAG applications. With this work, we introduce the first open-source hallucination detection models specifically designed for Turkish. Demonstrate that modern encoder architectures can effectively handle the morphological complexity of Turkish and achieve high performance. Provide a valuable resource for future research by releasing the translated RAGTruth dataset. These models are an essential tool for developers building Turkish AI applications where accuracy and trustworthiness are paramount, such as in finance, law, and healthcare. We plan to continue improving these models and optimizing them for a wider variety of tasks in the future. You can access our models and dataset on Hugging Face to start building more reliable Turkish AI applications today.

Our Mind

Detecting hallucinations is a vital component in developing trustworthy AI systems—particularly for morphologically complex, low-resource languages like Turkish. In our work with Turk-LettuceDetect, we show that language-aware, fine-tuned solutions can significantly outperform generic models, offering better precision-recall trade-offs. By applying token-level diagnostics and leveraging compact yet expressive encoder architectures like ModernBERT, we present a solution specifically tailored to the linguistic intricacies of Turkish.

This approach not only addresses a critical blind spot in multilingual AI safety but also lays a scalable groundwork for similar progress in other under-resourced languages. With its combination of contextual sensitivity and efficiency, Turk-LettuceDetect is well-positioned for real-time use in Retrieval-Augmented Generation (RAG) systems—an increasingly important backbone of AI workflows.

Going forward, we see major opportunities in expanding annotated datasets, incorporating human-in-the-loop feedback, and experimenting with hybrid modeling strategies to improve robustness. Our long-term vision is to promote open, collaborative development, enabling inclusive and dependable AI systems across a wide range of languages and domains.

References

[1] Kovács, Á., & Recski, G. (2025). Lettucedetect: A hallucination detection framework for rag applications. arXiv preprint arXiv:2502.17125.

[2] Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., ... & Poli, I. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663.

[3] artiwise-ai/modernbert-base-tr-uncased · Hugging Face. (2024, December 16). https://huggingface.co/artiwise-ai/modernbert-base-tr-uncased

[4] Boizard, N., Gisserot-Boukhlef, H., Alves, D. M., Martins, A., Hammal, A., Corro, C., ... & Colombo, P. (2025). EuroBERT: scaling multilingual encoders for European languages. arXiv preprint arXiv:2503.05500.

[5] Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., ... & Zhang, M. (2024). mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. arXiv preprint arXiv:2407.19669.

Community

ebubekirsoftware

4 days ago

Great work presenting Turk-LettuceDetect — it’s an important step for tackling hallucination challenges in Turkish RAG systems. The release of multiple token-level models is especially valuable for improving factual reliability in Turkish NLP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote