Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models Paper • 2506.04689 • Published Jun 5
DataComp: In search of the next generation of multimodal datasets Paper • 2304.14108 • Published Apr 27, 2023 • 2
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP Paper • 2208.05516 • Published Aug 10, 2022
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published Jun 17, 2024 • 55
Better Alignment with Instruction Back-and-Forth Translation Paper • 2408.04614 • Published Aug 8, 2024 • 16
Guiding Image Captioning Models Toward More Specific Captions Paper • 2307.16686 • Published Jul 31, 2023 • 16