DataComp

non-profit

https://www.datacomp.ai/dclm/index.html#home

AI & ML interests

None defined yet.

Recent Activity

MasterVito authored a paper 8 days ago

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

wannaphong authored a paper 26 days ago

Mangosteen: An Open Thai Corpus for Language Model Pretraining

yixinsong authored a paper about 1 month ago

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

View all activity

MasterVito

authored a paper 8 days ago

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Paper • 2508.14029 • Published 13 days ago • 116

wannaphong

authored a paper 26 days ago

Mangosteen: An Open Thai Corpus for Language Model Pretraining

Paper • 2507.14664 • Published Jul 19 • 6

yixinsong

authored a paper about 1 month ago

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Paper • 2507.20984 • Published Jul 28 • 56

lx865712528

authored a paper about 1 month ago

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Paper • 2507.15640 • Published Jul 21 • 4

oodgnas

authored 16 papers about 2 months ago

SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage

Paper • 2303.11114 • Published Mar 20, 2023

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Paper • 2303.11916 • Published Mar 21, 2023

Neglected Free Lunch; Learning Image Classifiers Using Annotation Byproducts

Paper • 2303.17595 • Published Mar 30, 2023 • 2

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Paper • 1904.01906 • Published Apr 3, 2019

Character Region Awareness for Text Detection

Paper • 1904.01941 • Published Apr 3, 2019

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Paper • 1905.04899 • Published May 13, 2019

Learning De-biased Representations with Biased Representations

Paper • 1910.02806 • Published Oct 7, 2019

An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods

Paper • 2003.03879 • Published Mar 9, 2020

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Paper • 2006.08217 • Published Jun 15, 2020

Rethinking Channel Dimensions for Efficient Model Design

Paper • 2007.00992 • Published Jul 2, 2020 • 1

Who Wrote this Code? Watermarking for Code Generation

Paper • 2305.15060 • Published May 24, 2023 • 1

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Paper • 2305.15080 • Published May 24, 2023

MPCHAT: Towards Multimodal Persona-Grounded Conversation

Paper • 2305.17388 • Published May 27, 2023 • 1

VideoMix: Rethinking Data Augmentation for Video Classification

Paper • 2012.03457 • Published Dec 7, 2020

Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels

Paper • 2101.05022 • Published Jan 13, 2021

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Paper • 2312.01998 • Published Dec 4, 2023