Tokenizer Study Collection Models comparing the effects of tokenizer properties on pre-training compression, and its relationship with downstream performance. • 84 items • Updated 3 days ago • 3
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais and 2 others • Nov 13, 2024 • 102