πΌοΈπ OneEncoder: A Unified Text & Image Model
OneEncoder is a lightweight framework for cross-modal alignment, focusing on efficiently integrating text and images (with future extensions to other modalities). Unlike traditional methods relying on massive modality-specific encoders, OneEncoder progressively aligns different data types, making it cost-effective and performant even on small paired datasets.
π Key Features
β
Multimodal Alignment: Initially supports text & image, with extension to other modalities.
β
Lightweight & Efficient: Avoids full retraining when adding new modalities.
β
Superior Performance: Outperforms models that require large specialized datasets.
π― Applications
- Visual Question Answering (VQA)
- Image-Text Retrieval
- Multimodal Content Understanding
π Research Paper
π arXiv: OneEncoder: Progressive Cross-Modal Alignment
π Resources
π GitHub Repo: OneEncoder
π Hugging Face Demo: OneEncoder Retriever
π Demo Notebook: OneEncoder Demos
π OneEncoder for Text, Image with temperature=2.5: HF Model
π OneEncoder for Text, Image & Audio: HF Model
π OneEncoder for Text, Image & Video: HF Model
π OneEncoder for Text, Image & X-ray: HF Model
π Authors
π Bilal FAYE, Hanane AZZAG, Mustapha LEBBAH, Djamel BOUCHAFFRA
Note: This model is training with temperature=1.0 and addition as fusion operation
- Downloads last month
- 8
Model tree for bilalfaye/OneEncoder
Base model
google-bert/bert-base-uncased