Hugging Face implementation

by Molbap - opened 6 days ago

6 days ago

•

Hello @Weiyun1025 and all authors, congratulations on the release 🙌!
I'm Pablo from Hugging Face. I see you have already a script to convert to HF format, and a link to HF weights https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B-HF which is currently not working, maybe still private?

We'd love to see this model implemented soon in transformers! Please let me know how far along you are in the conversion, we'd be super happy to jump in and help. Let me know!

Weiyun1025

OpenGVLab org 6 days ago

Thank you for your interest in our work! We will update the HF-format checkpoint within this week, as we are currently conducting accuracy verification.

Molbap

5 days ago

That's great! I've updated a conversion script on my end to take a look and make sure it's portable. I managed to run the 1B, and convert the 20B. I've pushed a short draft here: https://github.com/huggingface/transformers/pull/40506

For the image processing that you have included in your example (load_image and the function to find the closest aspect ratio), these are functionalities very close to what exists in for instance Phi4MultimodalImageProcessorFast, so you could likely use directly that image processor.

When you're done, feel free to ping me (molbap on github) for an early review!

Molbap

4 days ago

Hey @Weiyun1025 , just to let you know, I've updated the PR above and conversion script to make sure it works with current models. I tried all (the intermediate collection, not the core one) up to the 241B, and it gave correct generations, very cool model.
However in the intermediate collection I'm not finding the binary classifier to route between 64 image tokens and 256 image tokens. Is it only in a few models, or will it be in the core collection?

Weiyun1025

OpenGVLab org 3 days ago

Thank you for your contribution to InternVL! Over the past two days, we have completed the ckpt conversion and uploaded the model to HuggingFace. Our evaluation results show that both types of ckpt achieve consistent performance. I have been busy organizing various open-source code and documentation these past two days, so I didn’t notice that you had already implemented a version of the conversion script. We will highlight your script in the InternVL GitHub after it is merged into Transformers. Once again, thank you for your contribution to InternVL!

As for the Visual Resolution Router, we will open-source it on HuggingFace as soon as possible. We are currently organizing the model architecture code and training code. Each scale of the model will have a corresponding Flash version (i.e., equipped with a binary classifier to route image patches). Thank you for your interest in our work!

Molbap

about 21 hours ago

Amazing! Yes, I know that releases are always a very busy time, bets of luck to you. Looking forward to the ViR as well!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment