Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
Prasanna Iyer
prasiyer
Follow
LeroyDyer's profile picture
AMajesticRasun17381's profile picture
2 followers
ยท
2 following
prasiyer
AI & ML interests
None yet
Recent Activity
reacted
to
fdaudens
's
post
with ๐ฅ
25 days ago
Is this the best tool to extract clean info from PDFs, handwriting and complex documents yet? Open source olmOCR just dropped and the results are impressive. Tested the free demo with various documents, including a handwritten Claes Oldenburg letter. The speed is impressive: 3000 tokens/second on your own GPU - that's 1/32 the cost of GPT-4o ($190/million pages). Game-changer for content extraction and digital archives. To achieve this, Ai2 trained a 7B vision language model on 260K pages from 100K PDFs using "document anchoring" - combining PDF metadata with page images. Best part: it actually understands document structure (columns, tables, equations) instead of just jumbling everything together like most OCR tools. Their human eval results back this up. ๐ Try the demo: https://olmocr.allenai.org Going right into the AI toolkit: https://huggingface.co/spaces/JournalistsonHF/ai-toolkit
reacted
to
nroggendorff
's
post
with ๐
2 months ago
maybe a page where you can find open orgs to get started in collaboration with hf. i see so many people that dont have a direction. i dont have ulterior motives, so dont ask
replied
to
merve
's
post
8 months ago
Chameleon ๐ฆ by Meta is now available in Hugging Face transformers ๐ A vision language model that comes in 7B and 34B sizes ๐คฉ But what makes this model so special? Demo: https://huggingface.co/spaces/merve/chameleon-7b Models: https://huggingface.co/collections/facebook/chameleon-668da9663f80d483b4c61f58 keep reading โฅฅ Chameleon is a unique model: it attempts to scale early fusion ๐คจ But what is early fusion? Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder (LLM) Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation ๐ Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training and they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO) This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use. One can also do text-only prompting, authors noted the model catches up with larger LLMs (like Mixtral 8x7B or larger Llama-2 70B) and also image-pair prompting with larger VLMs like IDEFICS2-80B (see paper for the benchmarks https://huggingface.co/papers/2405.09818) Thanks for reading!
View all activity
Organizations
None yet
models
None public yet
datasets
None public yet