first vision language model built off openai/gpt-oss-20b just dropped! π₯
InternVL3.5 comes with 32 models π€― pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb comes with gpt-oss or Qwen3 for LLM part ‡οΈ
Fine-tune Gemma3n on videos with audios inside with Colab A100 π₯ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes π«‘ we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ππ» merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens π€―
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).
Here, I built a simple editor first for @dstackai, and I will share the live endpoint this week. Let me know what you think about this approach.
Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.