Tokenizer Choice For LLM Training: Negligible or Crucial? Paper • 2310.08754 • Published Oct 12, 2023 • 2
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings Paper • 2202.06671 • Published Feb 14, 2022 • 2
Specialized Document Embeddings for Aspect-based Similarity of Research Papers Paper • 2203.14541 • Published Mar 28, 2022
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning Paper • 2301.09626 • Published Jan 23, 2023 • 2
scilons/roberta-base-512-110k-steps-texts_pq_3-deduped-Eng_Latn Fill-Mask • Updated Nov 13, 2024 • 134
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published Jun 13, 2024 • 16
Semi-automatic staging area for high-quality structured data extraction from scientific literature Paper • 2309.10923 • Published Sep 19, 2023
Mining experimental data from Materials Science literature with Large Language Models: an evaluation study Paper • 2401.11052 • Published Jan 19, 2024 • 1
Mixture of Soft Prompts for Controllable Data Generation Paper • 2303.01580 • Published Mar 2, 2023 • 1