Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

updated a dataset 24 minutes ago

data-is-better-together/fineweb-c-progress

liked a model about 3 hours ago

Ihor/Text2Graph-R1-Qwen2.5-0.5b

liked a model about 11 hours ago

mistralai/Mistral-Small-24B-Instruct-2501

View all activity

Articles

Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas

FineWeb2-C: Help Build Better Language Models in Your Language

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

updated a dataset 24 minutes ago

data-is-better-together/fineweb-c-progress

Viewer • Updated 24 minutes ago • 783 • 381 • 3

liked a model about 3 hours ago

Ihor/Text2Graph-R1-Qwen2.5-0.5b

Text Generation • Updated about 5 hours ago • 2

liked a model about 11 hours ago

mistralai/Mistral-Small-24B-Instruct-2501

Text Generation • Updated about 7 hours ago • 290

liked a Space about 11 hours ago

Extractous

published a Space about 11 hours ago

Extractous

updated a Space about 12 hours ago

Extractous

updated a dataset about 13 hours ago

data-is-better-together/fineweb-c

Viewer • Updated about 13 hours ago • 58.1k • 822 • 36

updated a dataset about 14 hours ago

davanstrien/magpie-preference

Viewer • Updated about 14 hours ago • 517 • 1.12k • 13

liked 2 datasets about 16 hours ago

Magpie-Align/Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70B

Viewer • Updated 3 days ago • 150k • 57 • 6

cognitivecomputations/dolphin-r1

Viewer • Updated about 7 hours ago • 814k • 20 • 68

updated a dataset about 21 hours ago

librarian-bots/dataset_cards_with_metadata

Viewer • Updated about 21 hours ago • 214k • 352 • 12

updated a dataset about 24 hours ago

librarian-bots/dataset-columns

Viewer • Updated about 24 hours ago • 1.28M • 55

liked a Space 1 day ago

Running on Zero

Caracal

A simple app for doing HTR with various models.

upvoted an article 1 day ago

Article

Open-R1: a fully open reproduction of DeepSeek-R1

3 days ago

• 460

liked a model 1 day ago

ymoslem/ModernBERT-base-long-context-qe-v1

Text Classification • Updated 1 day ago • 32 • 4

liked 2 datasets 1 day ago

ServiceNow-AI/R1-Distill-SFT

Viewer • Updated 2 days ago • 1.85M • 317 • 94

huanqia/MM-IQ

Viewer • Updated about 14 hours ago • 2.71k • 12 • 5

updated a dataset 1 day ago

LabHC/histoires_morales

Viewer • Updated 1 day ago • 12k • 55 • 3

New activity in LabHC/histoires_morales 1 day ago

add citation information

#5 opened 1 day ago by

posted an update 1 day ago

Post

1279

Why choose between strong LLM reasoning and efficient models?

Use DeepSeek to generate high-quality training data, then distil that knowledge into ModernBERT answerdotai/ModernBERT-base for fast, efficient classification.

Blog post: https://danielvanstrien.xyz/posts/2025/deepseek/distil-deepseek-modernbert.html