Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 β’ 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 β’ 16
Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? May 7, 2024 β’ 8
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 β’ 74
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 29
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
Magpie-Align/Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70B Viewer β’ Updated 3 days ago β’ 150k β’ 57 β’ 6
librarian-bots/dataset_cards_with_metadata Viewer β’ Updated about 21 hours ago β’ 214k β’ 352 β’ 12
view post Post 1279 Why choose between strong LLM reasoning and efficient models?Use DeepSeek to generate high-quality training data, then distil that knowledge into ModernBERT answerdotai/ModernBERT-base for fast, efficient classification.Blog post: https://danielvanstrien.xyz/posts/2025/deepseek/distil-deepseek-modernbert.html See translation π 3 3 + Reply