FineData

community
Activity Feed

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

hynkyΒ  published a Space about 10 hours ago
HuggingFaceFW/FinePDFsBlog
hynkyΒ  updated a Space about 10 hours ago
HuggingFaceFW/FinePDFsBlog
guipenedoΒ  updated a Space about 10 hours ago
HuggingFaceFW/FinePDFsBlog
View all activity

hynkyΒ 
in HuggingFaceFW/finepdfs about 1 month ago

Dataset broken by latest update?

5
#27 opened about 1 month ago by
Rijgersberg
megΒ 
posted an update 2 months ago
view post
Post
3873
πŸ€– Did you know your voice might be cloned without your consent from just *one sentence* of audio?
That's not great. So with @frimelle , we brainstormed a new idea for developers who want to curb malicious use: ✨The Voice Consent Gate.✨
Details, code, here: https://huggingface.co/blog/voice-consent-gate
  • 3 replies
Β·
megΒ 
posted an update 4 months ago
view post
Post
2911
πŸ€– As AI-generated content is shared in movies/TV/across the web, there's one simple low-hanging fruit πŸ‡ to help know what's real: Visible watermarks. With the Gradio team, I've made sure it's trivially easy to add this disclosure to images, video, chatbot text. See how: https://huggingface.co/blog/watermarking-with-gradio
Thanks to the code collab in particular from @abidlabs and Yuvraj Sharma.
davanstrienΒ 
posted an update 4 months ago
eliebakΒ 
posted an update 4 months ago
view post
Post
3890
Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST πŸ€—

science
eliebakΒ 
posted an update 5 months ago
view post
Post
714
Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B
megΒ 
posted an update 5 months ago
megΒ 
posted an update 5 months ago
view post
Post
454
πŸ€– ICYMI: Yesterday, Hugging Face and OpenAI partnered to bring open source GPT to the public. This is a Big Deal in "AI world".

0. Common ground setting: OpenAI is the ChatGPT people. An β€œopen source” model is one whose weights are available β€” that means the model can be β€œyours”.
1. You don’t have to interact with the company directly, nor give them your interactions, to use the system. The company can't "surveil" you.
2. You can evaluate the unique contributions of their SOTA model much more rigorously than you can when there are collections of models+code behind a closed API. You can find out specifically what the model can and can't do.
3. And you can directly customize it for whatever you'd like. Fine-tuning, wherein you give the model data that's tailored to your use cases and train it some more on that data, is trivial* when you have the model weights.
*Provided you have the compute.
4. You can directly benchmark whatever you'd like. Biases? Energy usage? Strengths/weaknesses? Go for it. You wants it you gots it--this transparency helps people understand SOTA *in general*, not just for this model, but points to, e.g., what's going on with closed Google models as well.
5. One of the most powerful things about "openness" that I've learned is that it cultivates ecosystems of collaborators building on top of one another's brilliance to make systems that are significantly better than they would be if created in isolation.
But, caveat wrt my own philosophy...
6. I do not take it as a given that advancing LLMs is good, and have a lot more to say wrt where I think innovation should focus more. For example, a focus on *data* -- curation, measurement, consent, credit, compensation, safety -- would deeply improve technology for everyone.
7. The transparency this release provides is massive for people who want to *learn* about LLMs. For the next generation of technologists to advance over the current, they MUST be able to learn about what's happening now. (cont...)
  • 1 reply
Β·
megΒ 
posted an update 5 months ago
view post
Post
506
πŸ€– πŸ‘Ύ Thanks so much to BBC News and the stellar Suranjana Tewari for having me on to talk about US <β€”> China relationship in AI, and what it means for AI ethics.