File size: 919 Bytes
bb0fc30
 
 
 
 
 
 
 
 
4cddef6
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
---
title: README
emoji: πŸ“š
colorFrom: pink
colorTo: red
sdk: streamlit
pinned: false
---

# Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. Available from: https://arxiv.org/abs/2410.22886. 

Salhan et al (2024) creates age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. 

The MAO-CHILDES dataset contains extract orthographic datasets for French, German, Japanese and Chinese and several other lower-resource languages. It is part of a wider effort for cognitively-inspired pretraining using resources from Language Acquistiion.

You can also find pretrained BabyLMs for French, German, Japanese and Chinese with three different cognitively-inspired curriculum learning in the branches of each language-specific BabyLM.