Mangosteen: An Open Thai Corpus for Language Model Pretraining
Paper
•
2507.14664
•
Published
•
6
Mangosteen, a 47 billion-token Thai corpus built with a Thai-adapted pipeline, improves language model performance on Thai benchmarks.
Note Raw data for Thai Dolma - Commoncrawl - Fineweb2
Note Fineweb2 LD
Note + Quality filtering
Note +Deduplication
Note Fineweb2 by our pipeline (LD+ Quality filtering + Deduplication + Content filtering)
Note Commoncrawl LD
Note + Quality filtering
Note +Deduplication
Note Commoncrawl by our pipeline (LD+ Quality filtering + Deduplication + Content filtering)
Note FastText Model
Note Non-common crawl subset
Note common crawl subset
Note CPT base model
Note SFT model