LLM Evaluation Benchmark for Chinese Language Teaching (CLTE)

A comprehensive benchmark for evaluating large language models' capabilities as Chinese language teachers, consisting of three core evaluation dimensions.

Evaluation Framework

GitHub URL: https://github.com/Line-Kite/CLTE

Task Overview

Task 1: Basic Knowledge Evaluation

Objective: Assess foundational knowledge essential for international Chinese education
Coverage: 32 sub-topics across 5 major categories:
- Linguistics (307 questions)
- Chinese Culture (321 questions)
- Pedagogy (163 questions)
- World Culture (192 questions)
- Cross-cultural Communication (217 questions)
Total: 1,200 questions evaluating fundamental knowledge base

Task 2: International Teacher Examination

Objective: Evaluate comprehensive teaching literacy using authentic certification materials
Data Source: Real-world test questions from official International Chinese Language Teacher Certification exams
Format: Instructional passages accompanied by 2-10 single-choice questions (1,044 total questions)
Focus: Integrated linguistic and pedagogical reasoning in practical teaching scenarios

Task 3: Teaching Practice Evaluation

Objective: Measure instructional effectiveness through simulated teaching interactions
Methodology:
- Teacher models generate educational content from 120 teaching materials and guidelines
- Student models are tested before and after receiving instruction
- Effectiveness measured by performance improvement (120 assessment questions)

Citation

Please cite our paper if the work helps you.

@inproceedings{xu2025can,
  title={Can Large Language Models Be Good Language Teachers?},
  author={Xu, LiQing and Li, Qiwei and Peng, Tianshuo and Li, Zuchao and Zhao, Hai and Wang, Ping},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={23968--23982},
  year={2025}
}

license: cc-by-4.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support