metadata

tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
language:
  - en
base_model:
  - google/vit-base-patch16-224
  - google-bert/bert-base-uncased
  - MCG-NJU/videomae-base

🖼️📝 OneEncoder: A Unified Text & Image & Video Model

OneEncoder is a lightweight framework for cross-modal alignment, focusing on efficiently integrating text, image and video (with future extensions to other modalities). Unlike traditional methods relying on massive modality-specific encoders, OneEncoder progressively aligns different data types, making it cost-effective and performant even on small paired datasets.

🚀 Key Features

✅ Multimodal Alignment: Initially supports text & image & video, with extension to other modalities.
✅ Lightweight & Efficient: Avoids full retraining when adding new modalities.
✅ Superior Performance: Outperforms models that require large specialized datasets.

🎯 Applications

Visual Question Answering (VQA)
Image-Text-X-ray Retrieval
Multimodal Content Understanding

📄 Research Paper

📜 arXiv: OneEncoder: Progressive Cross-Modal Alignment

📌 Resources

🔗 GitHub Repo: OneEncoder
🚀 Hugging Face Demo: OneEncoder Retriever
📓 Demo Notebook: OneEncoder Demos
🔊 OneEncoder for Text, Image: HF Model
🔊 OneEncoder for Text, Image & Audio: HF Model
🔊 OneEncoder for Text, Image & X-ray: HF Model

📝 Authors

📌 Bilal FAYE, Hanane AZZAG, Mustapha LEBBAH, Djamel BOUCHAFFRA

Note: This model is training with temperature=2.5 and addition as fusion operation