Sam Horradarn's picture
4

Sam Horradarn

sirahd

AI & ML interests

None yet

Recent Activity

upvoted an article about 5 hours ago
Xet is on the Hub
published an article about 22 hours ago
Xet is on the Hub
updated a model 27 days ago
sirahd/test-xet-migration-2
View all activity

Organizations

Hugging Face's profile picture Xet Team's profile picture

sirahd's activity

upvoted an article about 5 hours ago
view article
Article

Xet is on the Hub

By jsulz and 5 others
16
published an article about 22 hours ago
view article
Article

Xet is on the Hub

By jsulz and 5 others
16
view reply

How can we find the chunk content using chunk hash?

Chunk hash is calculated via content-defined chunking (CDC), which means that if two chunks have the same content they will share the same hash. CDC removes the need to store the mapping between chunk hash -> chunk content because we know if two chunks share the same hash, they will have identical content.

The CAS system only stores "block_hash -> block_content", Where does the map of chunk to block?

This is explained in the "key chunks" section in the blog post above. Essentially we only store a tiny subset of chunk -> block by leveraging spatial locality in the file. Trying to store every mapping of chunk -> block can get impractical very quickly.

what does the shards store? Is it "file_name, shard_id, chunk_hash, block_hash"

You can think of the shards as storing mappings between file (identified via file hash) to list of chunks that make up the file.

I hope this help explains our underlying tech better!

upvoted an article 28 days ago
view article
Article

From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub

By jsulz and 3 others
50