AI & ML interests

ETL for LLMs

Recent Activity

jsulzย 
posted an update 5 months ago
view post
Post
3780
We've crossed 1 million repositories backed by Xet storage on Hugging Face! ๐Ÿš€๐Ÿš€๐Ÿš€

You can follow along our progress converting the Hub from Git LFS to Xet at jsulz/ready-xet-go

We have a lot of repos left to migrate, which means I have plenty of time to add more animations ๐Ÿคช
jsulzย 
posted an update 5 months ago
view post
Post
3291
We've moved over 20PB from Git LFS to Xet on the Hub without downtime or data loss. Having things "just work" on a migration of this scale is about as good as it gets.

Now, we're migrating the rest of the Hub https://huggingface.co/blog/migrating-the-hub-to-xet

But how did we get here?

In the early days of joining Hugging Face, we made a few key design decisions:
* There would be no "hard cut-over" from Git LFS to Xet
* A Xet-enabled repository should be able to contain both Xet and LFS files
* Repository migrations from LFS to Xet can run in the background without disrupting downloads or uploads

These were largely driven by our desire to ensure the community could keep working without interruption.

We cover the infrastructure making this all go in this post, specifically:
* An integral piece of infrastructure known internally as the Git LFS Bridge
* Background content migrations that run around the clock

To skip the wait and join Xet now, sign up here https://huggingface.co/join/xet
jsulzย 
posted an update 6 months ago
view post
Post
5560
It's been a bit since I took a step back and looked at
xet-team
progress to migrate Hugging Face from Git LFS to Xet, but every time I do it boggles the mind.

A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
๐Ÿค— 700,000 users/orgs
๐Ÿ“ˆ 350,000 repos
๐Ÿš€ 15PB

Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).

These are hard numbers to put into context, but let's try:

The latest run of the Common Crawl from
commoncrawl
was 471 TB.

We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.

We're moving to a new phase in the process, so stay tuned.

This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.

I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski ๐Ÿ‘€)

Let me know if there's anything you're interested in; happy to dig in!
ยท
jsulzย 
posted an update 7 months ago
view post
Post
842
With major model families like
Qwen
and all of Llama from
meta-llama
on Xet, the time is right for new users and organizations to say goodbye to LFS on the Hub.

Xet is now the default storage for new AI builders ๐Ÿš€ ๐Ÿš€ ๐Ÿš€

Just sign up for an account, create a new model or dataset, pip install huggingface_hub and you're off to the races!

Read more here https://huggingface.co/changelog/xet-default-for-new-users

And for everyone with existing repositories, just sign up here https://huggingface.co/join/xet - we'll migrate all existing repositories to Xet and all new repos you create will be Xet-backed by default.
jsulzย 
posted an update 7 months ago
view post
Post
2633
Heyo @RichardErkhov the
xet-team
at Hugging face was wondering if you wanted to join the fun and jump over to Xet storage. ๐Ÿค—

We've been onboarding folks https://huggingface.co/blog/xet-on-the-hub know the backend can scale (Llama 4 and Qwen 3 are on Xet), is great for working with quants (see xet-team/quantization-dedup ), and we're pushing on inviting impactful orgs and users on the Hub. You fit the bill.

We'd love to onboard you, get some feedback, and create some excitement ๐ŸŽ‰

The steps are pretty straightforward - join the waitlist at hf.co/join/xet and we'll take care of the rest.

The system is fully backward compatible, so you shouldn't notice a thing. BUT to get the best experience when uploading/downloading, make sure you have hf_xet installed alongside the latest huggingface_hub

What do you think?
  • 4 replies
ยท
jsulzย 
posted an update 8 months ago
view post
Post
2838
At
xet-team
we've been hard at work bringing a new generation of storage to the Hugging Face community, and weโ€™ve crossed some major milestones:

๐Ÿ‘ท Over 2,000 builders and nearing 100 organizations with access to Xet
๐Ÿš€ Over 70,000 model and dataset repositories are Xet-backed
๐Ÿคฏ 1.4 petabytes managed by Xet

As we move repos from LFS to Xet for everyone we onboard, weโ€™re pushing our content-addressed store (CAS). Check out the chart below ๐Ÿ‘‡ of CAS hitting up to 150 Gb/s throughput this past week.

All of this growth is helping us build richer insights. We expanded our repo graph, which maps how Xet-backed repositories on the Hub share bytes with each other.

Check out the current network in the image below (nodes are repositories, edges are where repos share bytes) and visit the space to see how different versions of Qwen, Llama, and Phi models are grouped together xet-team/repo-graph

Join the waitlist to get access! https://huggingface.co/join/xet
jsulzย 
posted an update 9 months ago
view post
Post
1185
As
xet-team
infrastructure begins backing hundreds of repositories on the Hugging Face Hub, weโ€™re getting to put on our researcher hats and peer into the bytes. ๐Ÿ‘€ ๐Ÿค“

IMO, one of the most interesting ideas Xet storage introduces is a globally shared store of data.

When you upload a file through Xet, the contents are split into ~64KB chunks and deduplicated, but what if those same chunks already exist in another repo on the Hub?

If we can detect and reuse them, we skip them as well saving time and bandwidth for AI builders. More on how that works here:
๐Ÿ”— https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation

Because of this, different repositories can share bytes we store. That opens up something cool - we can draw a graph of which repos actually share data at the chunk level, where:

- Nodes = repositories
- Edges = shared chunks
- Edge thickness = how much they overlap

xet-team/repo-graph

Come find the many BERT islands. Or see how datasets relate in practice, not just in theory. See how libraries or tasks can tie repositories together. You can play around with node size using storage/likes/downloads too.

The result is a super fun visualization from @saba9 and @znation that Iโ€™ve already lost way too much time to. I'm excited to see how the networks grow as we add more repositories!
jsulzย 
posted an update 9 months ago
view post
Post
3790
What does it mean when models share the same bytes?

We've investigated some quants and have seen that a considerable portion of quantizations of the same model share the same bytes and can be deduplicated to save considerable upload time for quantizers on the Hub.

This space where we crack open a repo from @bartowski shows we can get significant dedupe xet-team/quantization-dedup

You can get a sense of why by reading this write-up: https://github.com/bartowski1182/llm-knowledge/blob/main/quantization/quantization.md

But what about finetuned models?

Since going into production the
xet-team
has migrated hundreds of repositories on the Hub to our storage layer, including classic "pre-Hub" open-source models like FacebookAI/xlm-roberta-large (XLM-R) from
FacebookAI


XLM-R, introduced in 2019, set new benchmarks for multilingual NLP by learning shared representations across 100 languages. It was then fine-tuned on English, Spanish, Dutch, and German, generating language-specific derivations for each - check out the paper here Unsupervised Cross-lingual Representation Learning at Scale (1911.02116)

These finetunes share much of the same architecture and layout as XLM-R with similar training methods and goals. It makes sense that they would share bytes, but it's still fascinating to see.

We put together a similar space to explore these models to see where they overlap - check it out for yourself xet-team/finetune-dedupe

The darker each block in the heatmap, the more the bytes are shared. Clicking on a repos blocks shows all other repos that share blocks.
  • 1 reply
ยท
jsulzย 
posted an update 9 months ago
view post
Post
2264
The Llama 4 release - meta-llama/llama-4-67f0c30d9fe03840bc9d0164 - was a big one for the
xet-team
with every model backed by the storage infrastructure of the future for the Hub.

It's been a wild few days, and especially ๐Ÿคฏ to see every tensor file with a Xet logo next to it instead of LFS.

The attached graph shows requests per second to our content-addressed store (CAS) right as the release went live.

yellow = GETs; dashed line = launch time.

You can definitely tell when the community started downloading ๐Ÿ‘€

h/t to @rajatarya for the graph, the entire Xet crew to bring us to this point, and special shoutout to Rajat, @port8080 , @brianronan , @seanses , and @znation who made sure the bytes kept flying all weekend โšก๏ธ
  • 1 reply
ยท
jsulzย 
posted an update 9 months ago
view post
Post
3904
Huge week for
xet-team
as Llama 4 is the first major model on Hugging Face uploaded with Xet providing the backing! Every byte downloaded comes through our infrastructure.

Using Xet on Hugging Face is the fastest way to download and iterate on open source models and we've proved it with Llama 4 giving a boost of ~25% across all models.

We expect builders on the Hub to see even more improvements, helping power innovation across the community.

With the models on our infrastructure, we can peer in and see how well our dedupe performs across the Llama 4 family. On average, we're seeing ~25% dedupe, providing huge savings to the community who iterate on these state-of-the-art models. The attached image shows a few selected models and how they perform on Xet.

Thanks to the
meta-llama
team for launching on Xet!
jsulzย 
posted an update 9 months ago
view post
Post
2109
If you've been following along with the Xet Team's (
xet-team
) work, you know we've been working to migrate the Hugging Face Hub from Git LFS and to Xet.

Recently, we launched a waitlist to join the movement to Xet (join here! https://huggingface.co/join/xet ) but getting to this point was a journey.

From the initial proof of concept in August, to launching on the Hub internally, to migrating a set of repositories and routing a small chunk of download traffic on the Hub through our infrastructure. Every step of the way has been full of challenges, big and small, and well worth the effort.

Over the past few weeks, with real traffic flowing through our services weโ€™ve tackled some truly gnarly issues (unusual upload/download patterns, memory leaks, load imbalances, and more) and resolved each without major disruptions.

If you're curious about how this sliver of Hub infrastructure looks as we routed traffic through it for the first time (and want a deep dive full of Grafana and Kibana charts ๐Ÿค“) I have a post for you.

Here's an inside look into the day of our first migrations and the weeks following, where we pieced together solutions in real time.

https://huggingface.co/blog/xet-on-the-hub
jsulzย 
posted an update 10 months ago
view post
Post
1518
It's finally here โค๏ธ

Build faster than ever with lightning fast upload and download speeds starting today on the Hub โšก

Xet storage is rolling out access across the Hub - join the waitlist here https://huggingface.co/join/xet

You can apply for yourself, or your entire organization. Head over to your account settings for more information or join anywhere you see the Xet logo on a repository you know.

Have questions? Join the conversation below ๐Ÿ‘‡ or open a discussion on the Xet team page xet-team/README
ยท