Towards Data-Efficient Pretraining for Atomic Property Prediction
Abstract
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
Community
Smaller, task-relevant datasets can surpass state-of-the-art (SOTA) performance using just 1/24th of the compute in atomic property prediction. This paper introduces the Chemical Similarity Index (CSI) to measure dataset alignment, showing that selecting data with minimal CSI distance leads to better model performance. Surprisingly, adding more unrelated data can degrade accuracy, challenging the assumption that bigger datasets always improve results. Quality over quantity is key in pretraining.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Foundation Models on Graphs: An Analysis on Cross-Dataset Transfer of Pretrained GNNs (2024)
- Quantifying the Importance of Data Alignment in Downstream Model Performance (2025)
- LLM Pretraining with Continuous Concepts (2025)
- Enhancing Multilingual LLM Pretraining with Model-Based Data Selection (2025)
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models (2025)
- Scalable Vision Language Model Training via High Quality Data Curation (2025)
- Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper