πŸ–ΌοΈπŸ“ OneEncoder: A Unified Text & Image & Video Model

OneEncoder is a lightweight framework for cross-modal alignment, focusing on efficiently integrating text, image and video (with future extensions to other modalities). Unlike traditional methods relying on massive modality-specific encoders, OneEncoder progressively aligns different data types, making it cost-effective and performant even on small paired datasets.

πŸš€ Key Features

βœ… Multimodal Alignment: Initially supports text & image & video, with extension to other modalities.
βœ… Lightweight & Efficient: Avoids full retraining when adding new modalities.
βœ… Superior Performance: Outperforms models that require large specialized datasets.

🎯 Applications

  • Visual Question Answering (VQA)
  • Image-Text-X-ray Retrieval
  • Multimodal Content Understanding

πŸ“„ Research Paper

πŸ“œ arXiv: OneEncoder: Progressive Cross-Modal Alignment

πŸ“Œ Resources

πŸ”— GitHub Repo: OneEncoder
πŸš€ Hugging Face Demo: OneEncoder Retriever
πŸ““ Demo Notebook: OneEncoder Demos
πŸ”Š OneEncoder for Text, Image: HF Model
πŸ”Š OneEncoder for Text, Image & Audio: HF Model
πŸ”Š OneEncoder for Text, Image & X-ray: HF Model

πŸ“ Authors

πŸ“Œ Bilal FAYE, Hanane AZZAG, Mustapha LEBBAH, Djamel BOUCHAFFRA

Note: This model is training with temperature=2.5 and addition as fusion operation

Downloads last month
16
Safetensors
Model size
285M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for bilalfaye/OneEncoder-text-image-video

Finetuned
(506)
this model